HK1232312B

HK1232312B - Systems and methods for predicting a smoking status of an individual

Info

Publication number: HK1232312B
Application number: HK17105719.0A
Authority: HK
Inventors: F.马丁; M．塔利卡
Original assignee: 菲利普莫里斯生产公司
Priority date: 2013-12-16
Filing date: 2014-12-11
Publication date: 2021-04-01

Description

System and method for predicting smoking status of an individual

相关申请的引用Citation of Related Applications

本申请根据35U.S.C.§119要求2013年12月16日提交的名称为“用于预测个体的吸烟状况的系统和方法(Systems and Methods for Predicting a Smoking Status of anIndividual)”的美国临时专利申请61/916,443的优先权，所述美国临时专利申请全文并入本文中。This application claims priority under 35 U.S.C. §119 to U.S. Provisional Patent Application No. 61/916,443, filed on December 16, 2013, entitled “Systems and Methods for Predicting a Smoking Status of an Individual,” which is incorporated herein in its entirety.

背景技术Background Art

全基因组微阵列被用作测量全基因组表达水平和获得对各种病况的生物见解的实际手段。这种方法也被用来评定暴露于活性物质时的身体反应和预测所得表现型。可以检测到吸烟者的大气道细胞的转录组响应于香烟暴露所发生的分子改变，即使在没有明显的组织学异常时仍然可以检测到。这个观察结果表明，转录组数据也许可以用来评定生物系统在暴露于各种物质时的反应。Whole-genome microarrays are being used as a practical means to measure genome-wide expression levels and gain biological insights into a variety of conditions. This approach has also been used to assess the body's response to exposure to active substances and predict the resulting phenotype. Molecular alterations in the transcriptome of large airway cells from smokers in response to cigarette smoke exposure can be detected even in the absence of overt histological abnormalities. This observation suggests that transcriptomic data may be useful for assessing the response of biological systems to exposure to a variety of substances.

在很多的产品风险评定研究中，从所要的原发部位(如气道)获取样品是侵袭性的并且不方便。作为一个替代方案，外周采血是微创性的并且被广泛用于普通人群。因此，人们关注于寻找和建立可以在充当替代组织的外周血中可靠地使用的生物标志物。In many product risk assessment studies, obtaining samples from the desired primary site (e.g., the airway) is invasive and inconvenient. As an alternative, peripheral blood sampling is minimally invasive and widely used in the general population. Therefore, there is a focus on identifying and establishing biomarkers that can be reliably used in peripheral blood, which serves as a surrogate tissue.

探索分子生物标志物的前期尝试聚焦于鉴别案例与对照群体之间的差异表达的基因。最新的方法致力于越来越多地预测新案例，从而促使增强诊断、改良预后并且推进个体化用药。然而，对于临床应用来说稳固并且通用的计算方法的研发仍然具有挑战性。关于吸烟相关的疾病，已鉴别了外周血样品中的诊断标签。至少两个研究已经显示，差异表达基因可以区分患有早期非小细胞肺癌的受试者与对照受试者或患有非恶性肺病的受试者(Rotunno,M.,Hu,N.,Su,H.,Wang,C.,Goldstein,A.M.,Bergen,A.W.,Consonni,D.,Pesatori,A.C.,Bertazzi,P.A.,Wacholder,S.等人(2011).来自外周全血的用于I期肺腺癌的基因表达标签(A gene expression signature from peripheral whole blood forstage I lung adenocarcinoma).癌症预防研究(Cancer Prev Res)(Phila)4,1599-1608；Showe,M.K.,Vachani,A.,Kossenkov,A.V.,Yousef,M.,Nichols,C.,Nikonova,E.V.,Chang,C.,Kucharczuk,J.,Tran,B.,Wakeam,E.等人(2009).外周血单核细胞中的基因表达谱可以区分患有非小细胞肺癌的患者与患有非恶性肺病的患者(Gene expressionprofiles in peripheral blood mononuclear cells can distinguish patients withnon-small cell lung cancer from patients with nonmalignant lung disease).癌症研究(Cancer Res)69,9202-9210)。Early attempts to explore molecular biomarkers focused on identifying differentially expressed genes between case and control groups. The latest methods are dedicated to predicting new cases more and more, thereby promoting enhanced diagnosis, improved prognosis and promoting personalized medicine. However, the development of robust and universal computational methods for clinical applications remains challenging. Regarding smoking-related diseases, diagnostic signatures in peripheral blood samples have been identified. At least two studies have shown that differentially expressed genes can distinguish subjects with early non-small cell lung cancer from control subjects or subjects with non-malignant lung diseases (Rotunno, M., Hu, N., Su, H., Wang, C., Goldstein, A.M., Bergen, A.W., Consonni, D., Pesatori, A.C., Bertazzi, P.A., Wacholder, S. et al. (2011). A gene expression signature from peripheral whole blood for stage I lung adenocarcinoma. Cancer Prev Res) (Phila) 4, 1599-1608; Showe, M.K., Vachani, A., Kossenkov, A.V., Yousef, M., Nichols, C., Nikonova, E.V., Chang, C., Kucharczuk, J., Tran, B., Wakeam, E., et al. (2009). Gene expression profiles in peripheral blood mononuclear cells can distinguish patients with non-small cell lung cancer from patients with nonmalignant lung disease. Cancer Res 69, 9202-9210).

发明内容Summary of the Invention

提供用于鉴别基于血液的稳固的基因标签的计算系统和方法，所述基因标签可以用来预测个体的吸烟者状况。本文所述的基因标签能够区分目前在吸烟的受试者与从不吸烟或已戒烟的受试者，从而能够准确地预测个体的吸烟者状况。A computational system and method are provided for identifying a robust blood-based gene signature that can be used to predict an individual's smoker status. The gene signature described herein can distinguish between current smokers and never-smokers or former smokers, thereby enabling accurate prediction of an individual's smoker status.

在某些方面，本公开内容的系统和方法提供用于评定从受试者获得的样品的计算机化方法。计算机化方法包括通过接收电路接收与样品相关的数据集，所述数据集包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17的定量表达数据。处理器基于接收到的数据集产生评分，所述评分指示所预测的受试者的吸烟状况。所预测的吸烟状况可以将受试者归类为当前吸烟者或非当前吸烟者。In certain aspects, the systems and methods of the present disclosure provide a computerized method for evaluating a sample obtained from a subject. The computerized method includes receiving, via a receiving circuit, a data set associated with the sample, the data set comprising quantitative expression data for LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17. A processor generates a score based on the received data set, the score indicating a predicted smoking status of the subject. The predicted smoking status can classify the subject as a current smoker or a non-current smoker.

在某些具体实施中，数据集进一步包括IGJ、RRM2、SERPING1、FUCA1以及ID3的定量表达数据。在某些具体实施中，评分是向所述数据集应用的分类方案的结果，其中所述分类方案是基于数据集中的定量表达数据确定的。In some implementations, the dataset further includes quantitative expression data of IGJ, RRM2, SERPING1, FUCA1, and ID3. In some implementations, the score is the result of a classification scheme applied to the dataset, wherein the classification scheme is determined based on the quantitative expression data in the dataset.

在某些具体实施中，所述方法进一步包括计算LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17中的每一个的倍数变化值，并且确定每个倍数变化值满足至少一个准则。所述准则可以要求对于至少两个独立群体数据集来说，每个各自计算的倍数变化值超过预定阈值。In certain implementations, the method further comprises calculating a fold change value for each of LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17, and determining that each fold change value satisfies at least one criterion. The criterion may require that for at least two independent population data sets, each respective calculated fold change value exceeds a predetermined threshold.

在某些方面，本公开内容的系统和方法提供用于评定从受试者获得的样品的计算机化方法。设备包括用于检测测试样品中基因标签中的基因的表达水平的装置，所述基因标签包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17。所述设备还包括用于使表达水平与吸烟者状况的分类相关联的装置，和用于输出吸烟者状况的分类结果作为受试者的吸烟者状况的预测的装置。In certain aspects, the systems and methods of the present disclosure provide a computerized method for assessing a sample obtained from a subject. The apparatus includes a device for detecting the expression levels of genes in a gene signature in a test sample, the gene signature including LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17. The apparatus also includes a device for correlating the expression levels with a classification of a smoker's status, and a device for outputting the classification result of the smoker's status as a prediction of the subject's smoker's status.

在某些方面，本公开内容的系统和方法提供一种用于预测个体的吸烟者状况的试剂盒。所述试剂盒包括：一组检测测试样品中基因标签中的基因的表达水平的试剂，所述基因标签包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17；和使用所述试剂盒预测个体的吸烟者状况的说明书。In certain aspects, the systems and methods of the present disclosure provide a kit for predicting an individual's smoker status. The kit comprises: a set of reagents for detecting the expression levels of genes in a gene signature in a test sample, the gene signature comprising LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17; and instructions for using the kit to predict an individual's smoker status.

在某些方面，本公开内容的系统和方法提供一种用于评定吸烟产品的替代物对个体的影响的试剂盒。所述试剂盒包括：一组检测测试样品中基因标签中的基因的表达水平的试剂，所述基因标签包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17；和使用所述试剂盒评定所述替代物对个体的影响的说明书。吸烟产品的替代物可以是加热式烟草产品(heated tobacco product，HTP)，并且替代物对个体的影响可以是将个体归类为非吸烟者。In certain aspects, the systems and methods of the present disclosure provide a kit for assessing the effect of an alternative to a smoking product on an individual. The kit comprises: a set of reagents for detecting the expression levels of genes in a gene signature in a test sample, the gene signature comprising LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17; and instructions for using the kit to assess the effect of the alternative on an individual. The alternative to a smoking product may be a heated tobacco product (HTP), and the effect of the alternative on the individual may be classifying the individual as a non-smoker.

在某些方面，本公开内容的系统和方法提供一种用于评定从受试者获得的样品的方法。所述方法包括通过接收电路接收与样品相关的数据集。数据集包括选自由以下组成的群组的至少五个标志物的定量表达数据：LRRN3、CDKN1C、PALLD、SASH1、RGL1、TNFRSF17、IGJ、RRM2、SERPING1、FUCA1以及ID3。所述方法进一步包括通过处理器，基于接收到的数据集产生评分，所述评分指示所预测的受试者的吸烟状况。所预测的受试者的吸烟状况可以将受试者归类为当前吸烟者或非当前吸烟者。In certain aspects, the systems and methods of the present disclosure provide a method for assessing a sample obtained from a subject. The method includes receiving a data set associated with the sample via a receiving circuit. The data set includes quantitative expression data for at least five markers selected from a group consisting of: LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, and ID3. The method further includes generating, via a processor, a score based on the received data set, the score indicating a predicted smoking status of the subject. The predicted smoking status of the subject can classify the subject as a current smoker or a non-current smoker.

评分可以是向数据集应用的分类方案的结果，其中所述分类方案是基于数据集中的定量表达数据确定的。所述方法可以进一步包括计算LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17中的每一个的倍数变化值，并且确定每个倍数变化值满足至少一个准则。所述准则可以要求对于至少两个独立群体数据集来说，每个各自计算的倍数变化值超过预定阈值。The score can be the result of a classification scheme applied to the dataset, wherein the classification scheme is determined based on quantitative expression data in the dataset. The method can further include calculating a fold change value for each of LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17, and determining that each fold change value satisfies at least one criterion. The criterion can require that for at least two independent population datasets, each of the calculated fold change values exceeds a predetermined threshold.

在某些方面，本公开内容的系统和方法提供一种用于评定从受试者获得的样品的设备。所述设备包括用于检测测试样品中基因标签中的基因的表达水平的装置，所述基因标签包括至少五个选自由以下组成的群组的标志物：LRRN3、CDKN1C、PALLD、SASH1、RGL1、TNFRSF17、IGJ、RRM2、SERPING1、FUCA1以及ID3。所述设备进一步包括用于使表达水平与吸烟者状况的分类相关联的装置，和用于输出吸烟者状况的分类结果作为受试者的吸烟者状况的预测的装置。In certain aspects, the systems and methods of the present disclosure provide an apparatus for assessing a sample obtained from a subject. The apparatus includes means for detecting expression levels of genes in a gene signature in a test sample, the gene signature including at least five markers selected from the group consisting of: LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, and ID3. The apparatus further includes means for correlating the expression levels with a classification of a smoker's status, and means for outputting the classification result of the smoker's status as a prediction of the subject's smoker's status.

在某些方面，本公开内容的系统和方法提供一种用于预测个体的吸烟者状况的试剂盒。所述试剂盒包括：一组检测测试样品中基因标签中的基因的表达水平的试剂，所述基因标签包括至少五个选自由以下组成的群组的标志物：LRRN3、CDKN1C、PALLD、SASH1、RGL1、TNFRSF17、IGJ、RRM2、SERPING1、FUCA1以及ID3；和使用所述试剂盒预测个体的吸烟者状况的说明书。In certain aspects, the systems and methods of the present disclosure provide a kit for predicting an individual's smoker status. The kit comprises: a set of reagents for detecting expression levels of genes in a gene signature in a test sample, the gene signature comprising at least five markers selected from the group consisting of LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, and ID3; and instructions for using the kit to predict an individual's smoker status.

在某些方面，本公开内容的系统和方法提供一种用于评定吸烟产品的替代物对个体的影响的试剂盒。所述试剂盒包括：一组检测测试样品中基因标签中的基因的表达水平的试剂，所述基因标签包括至少五个选自由以下组成的群组的标志物：LRRN3、CDKN1C、PALLD、SASH1、RGL1、TNFRSF17、IGJ、RRM2、SERPING1、FUCA1以及ID3；和使用所述试剂盒评定替代物对个体的影响的说明书。吸烟产品的替代物可以是HTP，并且替代物对个体的影响可以是将个体归类为非吸烟者。In certain aspects, the systems and methods of the present disclosure provide a kit for assessing the effect of an alternative to a smoking product on an individual. The kit comprises: a set of reagents for detecting the expression levels of genes in a gene signature in a test sample, the gene signature comprising at least five markers selected from the group consisting of: LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, and ID3; and instructions for using the kit to assess the effect of the alternative on an individual. The alternative to a smoking product can be a high-pressure smoking product (HTP), and the effect of the alternative on the individual can be classifying the individual as a non-smoker.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在结合附图考虑以下详细描述之后，本公开内容的更多特征、其性质和各种优点将变得显而易见，在附图中同样的参考符号在所有附图中指代相同的部分，并且在附图中：Further features of the present disclosure, its nature and various advantages will become apparent upon consideration of the following detailed description taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:

图1是用于鉴别一组基因并且基于这组基因获得分类模型的方法的流程图。FIG1 is a flow chart of a method for identifying a set of genes and obtaining a classification model based on the set of genes.

图2是用于评定从受试者获得的样品的方法的流程图。2 is a flow chart of a method for assessing a sample obtained from a subject.

图3是示例性计算设备的框图，所述计算设备可用来实现本文所述计算机化系统中的任一个中的任一个部件。3 is a block diagram of an exemplary computing device that may be used to implement any component of any of the computerized systems described herein.

图4A、4B和4C是样品数据集中的差异表达基因的火山图。4A , 4B and 4C are volcano plots of differentially expressed genes in the sample datasets.

图5A、5B、5C、5D、5E和5F是指示不同研究的分类方案的各种盒形图。5A , 5B, 5C, 5D, 5E, and 5F are various box plots indicating the classification schemes of different studies.

具体实施方式DETAILED DESCRIPTION

本文描述了用于鉴别基于血液的稳固的基因标签的计算系统和方法，所述基因标签可以用来预测个体的吸烟者状况。具体来说，本文所述的基因标签能够区分目前在吸烟的受试者与从不吸烟或已戒烟的受试者。This article describes a computational system and method for identifying a robust blood-based gene signature that can be used to predict an individual's smoker status. Specifically, the gene signature described herein can distinguish between subjects who are currently smoking and subjects who have never smoked or have quit smoking.

如本文所用，“稳固的”基因标签是能跨越研究、实验室、样品来源以及其它人口因素维持强大性能的基因标签。重要的是，稳固标签应该是可检测的，即使是在一组包括大的个体差异的群体数据中仍然是可检测的。为了避免过于乐观地报告标签的性能，还应该恰当地验证在整个数据集中的稳固性。As used herein, a "robust" gene signature is a gene signature that maintains strong performance across research, laboratories, sample sources, and other population factors. Importantly, a robust label should be detectable, even if it is detectable in a group of population data that includes large individual differences. In order to avoid overly optimistic reporting of the performance of the label, the robustness in the entire data set should also be properly verified.

本公开内容的一个目标是获得可以准确地预测个体的吸烟者状况的基因标签。为了评估基因标签的性能，本文在下表中示出了预测结果，下表在各行中展示了所预测的状况并且在各列中展示了真正的状况。下表1示出了展示预测结果的一种方式的实例。表格第一行指出了经过预测样品应该与当前吸烟者相关的真正的当前吸烟者和非当前吸烟者人群，并且表格第二行指出了经过预测样品应该与非当前吸烟者相关的真正的当前吸烟者和非当前吸烟者人群。One goal of the present disclosure is to obtain a gene signature that can accurately predict an individual's smoker status. In order to evaluate the performance of the gene signature, the prediction results are shown in the table below, which shows the predicted status in each row and the true status in each column. Table 1 below shows an example of a way to demonstrate the prediction results. The first row of the table indicates the true current smoker and non-current smoker populations that the predicted sample should be associated with the current smoker, and the second row of the table indicates the true current smoker and non-current smoker populations that the predicted sample should be associated with the non-current smoker.

表1Table 1

完美预测器是将所有当前吸烟者准确地预测为当前吸烟者(真阳性将是100％并且假阴性将是0％)，并且将所有非当前吸烟者准确地预测为非当前吸烟者(真阴性将是100％并且假阳性将是0％)。如本文所述，根据吸烟状况对个体进行分类(例如，当前吸烟者、非当前吸烟者、既往吸烟者、从不吸烟者等)，但一般来说，本领域的普通技术人员将理解，本文所述的系统和方法适用于任何分类方案。A perfect predictor is one that accurately predicts all current smokers as current smokers (true positives would be 100% and false negatives would be 0%), and accurately predicts all non-current smokers as non-current smokers (true negatives would be 100% and false positives would be 0%). As described herein, individuals are classified according to smoking status (e.g., current smoker, non-current smoker, former smoker, never smoker, etc.), but in general, one of ordinary skill in the art will understand that the systems and methods described herein are applicable to any classification scheme.

为了评估预测器的强度，可以使用基于预测结果表中的值的各种度量标准。在本文中，一种度量标准称为“灵敏度”，其为当前吸烟者组中被准确地归类为当前吸烟者的个体的比例。换句话说，灵敏度度量标准等于真阳性的数量除以真阳性和假阴性的总和或TP/(TP+FN)。灵敏度值1表示当前吸烟者的完美分类。在本文中，另一种度量标准称为“特异性”，其为非当前吸烟者组中被准确地归类为非当前吸烟者的个体的比例。换句话说，特异性度量标准等于真阴性的数量除以真阴性和假阳性的总和或TN/(TN+FP)。特异性值1表示非当前吸烟者的完美分类。为了被认为是一种强预测器，期望灵敏度值和特异性值高。虽然本文使用灵敏度和特异性度量标准来评估预测器的性能，但是一般来说，也可以在不脱离本公开内容的范围的情况下使用任何其它度量标准，如阳性测试的预测值(TP/(TP+FP))或阴性测试的预测值(TN/(TN+FN))。In order to evaluate the strength of a predictor, various metrics based on the values in the prediction results table can be used. In this article, one metric is called "sensitivity," which is the ratio of individuals in the current smoker group who are accurately classified as current smokers. In other words, the sensitivity metric is equal to the number of true positives divided by the sum of true positives and false negatives, or TP/(TP+FN). A sensitivity value of 1 represents a perfect classification of current smokers. In this article, another metric is called "specificity," which is the ratio of individuals in the non-current smoker group who are accurately classified as non-current smokers. In other words, the specificity metric is equal to the number of true negatives divided by the sum of true negatives and false positives, or TN/(TN+FP). A specificity value of 1 represents a perfect classification of non-current smokers. In order to be considered a strong predictor, it is desirable that the sensitivity and specificity values are high. Although sensitivity and specificity metrics are used herein to evaluate the performance of a predictor, in general, any other metric, such as the predictive value (TP/(TP+FP)) of a positive test or the predictive value (TN/(TN+FN)) of a negative test, can also be used without departing from the scope of this disclosure.

本文所述的系统和方法通过以下步骤构造了一种预测模型：首先从不同的训练数据集中鉴别出所展现的表达水平的倍数变化高的基因。然后，用独立的数据集验证所鉴别的这组基因。验证后，通过评估吸烟者状态已知的受试者的血液转录组并且针对具有一种吸烟者状况的个体与具有另一种吸烟者状况的个体比较所鉴别的那组基因的表达水平来测试这组基因。所得的这组经过成功验证和测试的基因在本文中称为“基因标签”。The system and method described herein constructs a predictive model by the following steps: first, genes with high fold changes in expression levels are identified from different training data sets. Then, the identified set of genes is verified using an independent data set. After verification, the set of genes is tested by evaluating the blood transcriptome of a subject whose smoker status is known and comparing the expression levels of the identified set of genes for individuals with one smoker status with those with another smoker status. The resulting set of genes that have been successfully verified and tested is referred to as a "gene signature" in this article.

基因标签可以用来将个体准确地分到特定的所预测的吸烟者状况组中。此外，通过能够准确地预测个体的吸烟者状况，基因标签能够通过比较使用HTP的个体的结果与吸常规香烟的个体的结果，检测到各种HTP的使用。可以在需要关于吸烟行为的遵从性的情况下使用基因标签。在一个实例中，所预测的个体的吸烟者状况(如通过基因标签所确定)可以在HTP的临床试验中用于鉴别在个体转换到HTP之后，个体是否出现生物学变化或个体何时出现生物学变化。一般来说，基因标签可以用于监测吸烟、戒烟或转换到HTP的任何与健康相关的研究中。Genetic tags can be used to accurately classify individuals into specific predicted smoker status groups. In addition, by being able to accurately predict the smoker status of an individual, the genetic tags can detect the use of various HTPs by comparing the results of individuals using HTPs with the results of individuals smoking conventional cigarettes. Genetic tags can be used in situations where compliance with smoking behavior is needed. In one example, the predicted smoker status of an individual (as determined by the genetic tags) can be used in clinical trials of HTPs to identify whether biological changes occur in an individual after the individual switches to the HTP or when biological changes occur in the individual. In general, genetic tags can be used in any health-related research on monitoring smoking, quitting smoking or switching to an HTP.

在一个实例中，从若干公开可用的基因表达数据集获得剖析当前吸烟者和非吸烟者或既往吸烟者的血液样品的数据。基于高倍数变化基因从各个独立研究中预选择基因是有利的，因为这么做增强了标签在不同研究中的稳固性并且确保预测模型不会因为单一数据集而有偏差。用独立数据集进行验证，所述独立数据集源自旨在探索COPD的新颖生物标志物的临床研究。另外，根据另一个临床研究，对连续5天从常规香烟(其燃烧烟草)转换到HTP(其不燃烧烟草；本文中称为烟草加热系统(Tobacco Heating System，THS)2.1)的吸烟者的血液转录组加以评估并将其与继续吸常规香烟的吸烟者进行比较。本文所述的标签在对当前吸烟者和非当前吸烟者进行分类方面表现非常好，如使用独立数据集由所述标签的性能所展示。另外，可在血液转录组中检测到持续5天转换到THS 2.1的影响，因为转换到THS 2.1的受试者被归类为非当前吸烟者。这说明，本文中的基因标签以及系统和方法不仅可以用于确定吸烟者状况，而且可以用于评估吸烟的短期影响。In one example, data from several publicly available gene expression data sets were obtained to analyze blood samples of current smokers and non-smokers or former smokers. Preselecting genes from each independent study based on high fold change genes is advantageous because it enhances the robustness of the label in different studies and ensures that the prediction model will not be biased by a single data set. Validated with an independent data set, the independent data set is derived from a clinical study designed to explore novel biomarkers for COPD. In addition, according to another clinical study, the blood transcriptome of smokers who switched from conventional cigarettes (which burn tobacco) to HTP (which does not burn tobacco; referred to herein as Tobacco Heating System (THS) 2.1) for 5 consecutive days was evaluated and compared with smokers who continued to smoke conventional cigarettes. The label described herein performs very well in classifying current smokers and non-current smokers, as demonstrated by the performance of the label using an independent data set. In addition, the effect of switching to THS 2.1 for 5 consecutive days can be detected in the blood transcriptome because subjects who switched to THS 2.1 were classified as non-current smokers. This suggests that the gene signatures and systems and methods described herein can be used not only to determine smoker status but also to assess the short-term effects of smoking.

使用基于有限数量的基因的标签相对于使用整个转录组来说就降低成本和工作量而言是有利的，因为分析最终将基于定量逆转录酶-聚合酶链式反应(quantitativereverse transcriptase-polymerase chain reaction，qRT-PCR)测量。使用qRT-PCR，设备和运行成本方面(如试剂)的投资比使用微阵列有利。Using signatures based on a limited number of genes is advantageous in terms of reducing cost and workload compared to using the entire transcriptome, because analysis is ultimately based on quantitative reverse transcriptase-polymerase chain reaction (qRT-PCR) measurements. With qRT-PCR, the investment in equipment and running costs (e.g., reagents) is more favorable than with microarrays.

在一个实例中，在第一步中，获得不同的训练数据集以鉴别基因标签。确切地说，本文使用两个训练数据集：BLD-SMK-01和QASMC。然而，一般来说，可以在不脱离本公开内容的范围的情况下使用训练数据集的任何数量的任何组合。In one example, in a first step, different training data sets are obtained to identify gene signatures. Specifically, two training data sets are used herein: BLD-SMK-01 and QASMC. However, in general, any combination of any number of training data sets can be used without departing from the scope of this disclosure.

关于BLD-SMK-01，从存储库(美国马里兰州贝茨维尔20705的博仕生物技术公司(BioServe Biotechnologies Ltd,Beltsville,MD 20705USA))获得使用PAXgene血液DNA试剂盒(凯杰(Qiagen))收集的血液样品。在采样时，受试者的年龄在23岁与65岁之间。排除无疾病史的受试者和正在服用处方药的受试者。当前吸烟者已经持续至少3年每天吸至少10根香烟。既往吸烟者在采样之前已戒烟至少2年并且在戒烟之前已有至少3年每天吸至少10根香烟。当前吸烟者和非吸烟者通过年龄和性别匹配。从当前吸烟者获得总共31份血液样品，从从不吸烟者获得30份血液样品，并且从既往吸烟者获得30份血液样品。About BLD-SMK-01, blood samples collected using the PAXgene blood DNA kit (Qiagen) were obtained from a repository (BioServe Biotechnologies Ltd, Beltsville, MD 20705 USA). At the time of sampling, the subjects were between 23 and 65 years old. Subjects with no history of disease and subjects taking prescription medication were excluded. Current smokers have been smoking at least 10 cigarettes a day for at least 3 years. Former smokers had quit smoking for at least 2 years before sampling and had smoked at least 10 cigarettes a day for at least 3 years before quitting. Current smokers and non-smokers were matched by age and gender. A total of 31 blood samples were obtained from current smokers, 30 blood samples were obtained from never smokers, and 30 blood samples were obtained from former smokers.

还从安妮女王街医疗中心(Queen Ann Street Medical Center，QASMC)临床研究获得血液样品，所述临床研究在英国伦敦的心肺中心(The Heart and Lung Centre)根据优质临床规范(Good Clinical Practice，GCP)进行并且已在ClinicalTrials.gov上以标识符NCT01780298注册。QASMC研究旨在鉴别能够区分患有COPD的受试者(吸烟史≥10包年(pack year)且处于GOLD阶段1或2的当前吸烟者)与三组对照的相匹配的非吸烟受试者(从不吸烟者、既往吸烟者和当前吸烟者)的生物标志物或一组生物标志物。在四组中的每一组中，从六十名受试者获得样品(总共240名受试者)。包括年龄介于40岁与70岁之间的男性和女性受试者。所有受试者都通过种族、性别和年龄(相差5岁以内)与所述研究中所招募的COPD受试者相匹配。将血液样品传送到丹麦奥尔胡斯的AROS应用生物技术公司(AROSApplied Biotechnology AS(Aarhus,Denmark))，所述血液样品在此公司经过进一步处理并且然后与昂飞(Affymetrix)人类基因组U133 Plus2.0基因芯片杂交，如下所述。Blood samples were also obtained from a Queen Ann Street Medical Center (QASMC) clinical study, which was conducted at the Heart and Lung Centre in London, England, according to Good Clinical Practice (GCP) and registered on ClinicalTrials.gov with identifier NCT01780298. The QASMC study was designed to identify biomarkers or a panel of biomarkers that can distinguish subjects with COPD (current smokers with a smoking history of ≥10 pack-years and GOLD stage 1 or 2) from matched non-smoking subjects (never smokers, former smokers, and current smokers) in three groups of controls. In each of the four groups, samples were obtained from sixty subjects (240 subjects in total). Male and female subjects aged between 40 and 70 years were included. All subjects were matched to the COPD subjects recruited in the study by race, sex, and age (within 5 years of age). Blood samples were sent to AROS Applied Biotechnology AS (Aarhus, Denmark) where they were further processed and then hybridized to Affymetrix Human Genome U133 Plus 2.0 GeneChips as described below.

根据制造商的说明书，使用PAXgene血液miRNA试剂盒(目录编号763134；凯杰)分离总RNA(包括微小RNA)。使用UV分光光度计(NanoDrop ND1000；美国马萨诸塞州沃尔瑟姆的赛默飞世尔科技(Thermo Fisher Scientific,Waltham,MA,USA))，通过在230、260和280nm下测量吸光度来确定RNA样品的浓度和纯度。使用安捷伦(Agilent)2100生物分析仪进一步检查RNA的完整性。仅对RNA完整指数(RNA integrity number，RIN)大于6的RNA进行处理以用于进一步分析。Total RNA (including microRNA) was isolated using the PAXgene blood miRNA kit (Cat. No. 763134; Qiagen) according to the manufacturer's instructions. The concentration and purity of the RNA samples were determined by measuring the absorbance at 230, 260, and 280 nm using a UV spectrophotometer (NanoDrop ND1000; Thermo Fisher Scientific, Waltham, MA, USA). The integrity of the RNA was further checked using an Agilent 2100 bioanalyzer. Only RNA with an RNA integrity number (RIN) greater than 6 was processed for further analysis.

RNA制备和昂飞杂交.使用NuGEN^TMOvation^TM全血试剂和NuGEN^TMOvation^TMRNA扩增系统V2，由50ng RNA制备靶向转录物的3'端的昂飞探针组。用Nanodrop 1000或8000分光光度计(赛默飞世尔科技)或SpectraMax 384Plus(Molecular Devices)测量cDNA的数量。通过使用安捷伦2100生物分析仪评定未片段化cDNA的大小来确定cDNA的质量。还使用电泳图监测最终的片段化并且生物素化的产物的大小分布。在标记cDNA之后，根据制造商的指南，使所述片段与GeneChip Human Genome U133Plus 2.0阵列杂交。将用于目标制备的样品完全随机化以用于昂飞基因表达微阵列。RNA Preparation and Affymetrix Hybridization. Affymetrix probe sets targeting the 3' end of transcripts were prepared from 50 ng of RNA using NuGEN ^™ Ovation ^™ Whole Blood Reagent and the NuGEN ^™ Ovation ^™ RNA Amplification System V2. The amount of cDNA was measured using a Nanodrop 1000 or 8000 spectrophotometer (Thermo Fisher Scientific) or a SpectraMax 384 Plus (Molecular Devices). The quality of the cDNA was determined by assessing the size of the unfragmented cDNA using an Agilent 2100 Bioanalyzer. The size distribution of the final fragmented and biotinylated products was also monitored using electropherograms. After labeling the cDNA, the fragments were hybridized to GeneChip Human Genome U133 Plus 2.0 arrays according to the manufacturer's instructions. Samples used for target preparation were fully randomized for use on Affymetrix gene expression microarrays.

Taqman qRT-PCR分析.根据制造商的说明书，使用iScript^TMcDNA合成试剂盒(目录编号170-8890；美国加利福尼亚州埃库莱斯的伯乐公司(Bio-Rad,Hercules CA,USA))，用500ng起始RNA进行逆转录反应。然后，将cDNA精确地稀释到10ng/μL。向样品中添加商用人类通用RNA(human universal RNA，UHR)参考(目录编号740000，美国加利福尼亚州圣克拉拉的安捷伦技术(Agilent Technologies,Santa Clara,CA,USA))作为校准剂，以可靠地比较多个实验和仪器的数据。Taqman分析中所用的探针跨越外显子，并且选择五个管家基因(B2M、GAPDH、FARP1、A4GALT、GINS2)用于数据归一化步骤。使用分析和快速高级主混合物(目录：444963)进行qPCR步骤。简单来说，稀释cDNA，以允许在384孔板中每孔涂覆1.25ng。并行地，为每个Taqman分析制备主混合物(含Taqman分析试剂和Taqman高级混合物)。最终反应体积是10μL。使用Viia7仪器(生命技术(LifeTechnologies))运行qPCR并且应用自动基线和默认C_t阈值设定以便分析结果。在添加通用人类参考(Universal Human Reference，UHR)样品时，使每个基因的C_t值相对于UHR C_t值归一化(通过减法)，并且然后相对于GAPDH管家基因值归一化(产生所谓的ΔΔC_t值)。Taqman qRT-PCR analysis. Reverse transcription was performed with 500 ng of starting RNA using the iScript ^™ cDNA Synthesis Kit (Cat. No. 170-8890; Bio-Rad, Hercules CA, USA) according to the manufacturer's instructions. The cDNA was then accurately diluted to 10 ng/μL. A commercial human universal RNA (UHR) reference (Cat. No. 740000, Agilent Technologies, Santa Clara, CA, USA) was added to the samples as a calibrator to reliably compare data from multiple experiments and instruments. The probes used in the Taqman analysis spanned exons, and five housekeeping genes (B2M, GAPDH, FARP1, A4GALT, GINS2) were selected for the data normalization step. The qPCR step was performed using the Assay and Rapid Advanced Master Mix (Cat. No. 444963). In brief, cDNA was diluted to allow 1.25 ng to be coated per well in a 384-well plate. In parallel, a master mix (containing Taqman assay reagents and Taqman advanced mix) was prepared for each Taqman assay. The final reaction volume was 10 μL. qPCR was run using a Viia7 instrument (Life Technologies) and automatic baseline and default C _t threshold settings were applied to analyze the results. When a Universal Human Reference (UHR) sample was added, the C _t value for each gene was normalized (by subtraction) relative to the UHR C _t value and then normalized relative to the GAPDH housekeeping gene value (generating so-called ΔΔC _t values).

从美国加利福尼亚州的生命技术获得Taqman引物。下表2列举了用于执行qRT-PCR的引物序列。Taqman primers were obtained from Life Technologies, CA, USA. Table 2 below lists the primer sequences used to perform qRT-PCR.

表2Table 2

微阵列分析-数据质量检查和归一化.在调查芯片图像以检测关于芯片扫描的矫作物之后，经由标准质量控制管道处理数据。简单来说，使用affy封装的ReadAffy功能(Gautier,L.,Cope,L.,Bolstad,B.M.和Irizarry,R.A.(2004).affy---在探针水平上分析昂飞基因芯片数据(affy---analysis of Affymetrix GeneChip data at the probelevel).生物信息学(Bioinformatics)20,307-315)[6]读取原始数据文件，所述affy封装来自微阵列分析工具的Bioconductor套组(Gentleman,R.C.,Carey,V.J.,Bates,D.M.,Bolstad,B.,Dettling,M.,Dudoit,S.,Ellis,B.,Gautier,L.,Ge,Y.,Gentry,J.等人(2004).Bioconductor：计算生物学和生物信息学开放式软件开发(Bioconductor:opensoftware development for computational biology and bioinformatics).基因组生物学(Genome Biol)5,R80)，所述微阵列分析工具可用于R统计环境(R研发核心团队(RDevelopment Core Team)(2007).R：一种用于统计计算的语言和环境(R:A Language andEnvironment for Statistical Computing))。通过产生并且检查以下各者来控制质量：RNA降解曲线(affy封装的AffyRNAdeg功能)、[09:42:29]归一化的未缩放的标准误差曲线、相对对数表达曲线(affyPLM封装(Brettschneider,J.,Collins,F.和Bolstad,B.M.(2008).短寡核苷酸微阵列数据的质量评定(Quality Assessment for ShortOligonucleotide Microarray Data).技术计量学(Technometrics)50,241-264))以及相对对数表达值的平均值。另外，用眼检查伪图像(探针水平模型的残留)来确保不存在空间效应。在质量控制检查时排除低于一组阈值的阵列进行进一步分析。Microarray Analysis - Data Quality Check and Normalization. After investigation of the chip images to detect artifacts on the chip scans, the data were processed through a standard quality control pipeline. Briefly, the raw data files were read using the ReadAffy function of the affy package (Gautier, L., Cope, L., Bolstad, B.M. and Irizarry, R.A. (2004). affy---analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307-315) [6], which is from the Bioconductor suite of microarray analysis tools (Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. The microarray analysis tools were used in the R statistical environment (R Development Core Team (2007). R: A Language and Environment for Statistical Computing). Quality control was performed by generating and inspecting the following: RNA degradation curves (AffyRNAdeg function of the affy package), [09:42:29] normalized unscaled standard error curves, relative logarithmic expression curves (affyPLM package (Brettschneider, J., Collins, F. and Bolstad, B.M. (2008). Quality Assessment for Short Oligonucleotide Microarray Data. Technometrics 50, 241-264)), and the mean of the relative logarithmic expression values. In addition, artifact images (residues of the probe level model) were visually inspected to ensure the absence of spatial effects. Arrays falling below a set threshold were excluded from further analysis during quality control checks.

关于群体水平分析(即，平均倍数变化研究)，随后使用GC-稳固微阵列分析(GC-Robust Microarray Analysis，GC-RMA)使数据归一化。使用背景校正和分位数归一化，从通过质量控制检查的所有阵列产生微阵列表达值(Irizarry,R.A.,Hobbs,B.,Collin,F.,Beazer-Barclay,Y.D.,Antonellis,K.J.,Scherf,U.和Speed,T.P.(2003).高密度寡核苷酸阵列探针水平数据的探索、归一化和总结(Exploration,normalization,and summariesof high density oligonucleotide array probe level data).生物统计学(Biostatistics)4,249-264)。关于个体标签预测模型，用MAS5(Affymetrix,I.(2002).统计学算法说明文件(Statistical algorithms description document).技术论文(Technical paper))使数据归一化。For population level analysis (i.e., mean fold change studies), data were then normalized using GC-Robust Microarray Analysis (GC-RMA). Microarray expression values were generated from all arrays that passed quality control checks using background correction and quantile normalization (Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., and Speed, T.P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249-264). For individual signature prediction models, data were normalized using MAS5 (Affymetrix, I. (2002). Statistical algorithms description document. Technical paper).

统计建模-群体水平分析.关于每一个比较，拟合总体线性模型，基于适中的t统计法产生每个探针组在表达阵列上的原始p值。使用本亚明-霍赫贝格错误发现率(Benjamini-Hochberg False Discovery Rate，FDR)方法校正因为评估大量基因而出现的多个测试影响。Statistical Modeling - Population-Level Analysis. For each comparison, an overall linear model was fitted, and raw p-values were generated for each probe set on the expression array based on a moderated t-statistic. The Benjamini-Hochberg False Discovery Rate (FDR) method was used to correct for multiple testing effects due to the large number of genes evaluated.

统计建模-个体样品预测建模.为了在预测模型中实现稳固性，从美国国家生物技术信息中心基因表达大棚车(Gene Expression Omnibus，GEO)(http://www.ncbi.nlm.nih.gov/gds/？term＝GEO)获得来自血液的独立基因表达数据集(GSE15289)和PBMC(GSE42057)并加以处理。来自NOWAC研究的数据集(GSE15289)(Dumeaux,V.,Olsen,K.S.,Nuel,G.,Paulssen,R.H.,A.-L.和Lund,E.(2010a).解密正常血液基因表达变异：NOWAC后基因组研究(Deciphering normal blood geneexpression variation—The NOWAC postgenome study).公共科学图书馆·遗传学(PLoSgenetics)6,e1000873)包括来自年龄介于48岁与63岁之间的285名绝经后女性的全血样品，包括211名从不吸烟者和74名当前吸烟者。Bahr等人的数据集(GSE42057)(Bahr,T.M.,Hughes,G.J.,Armstrong,M.,Reisdorph,R.,Coldren,C.D.,Edwards,M.G.,Schnell,C.,Kedl,R.,LaFlamme,D.J.和Reisdorph,N.(2013).慢性阻塞性肺病中的外周血单核细胞基因表达(Peripheral Blood Mononuclear Cell Gene Expression in ChronicObstructive Pulmonary Disease).美国呼吸道细胞和分子生物学杂志(Americanjournal of respiratory cell and molecular biology))源自从36名当前吸烟者(其中22人患有COPD并且14人是健康的)和100名既往吸烟者(其中72人患有COPD并且28人是健康的)收集的外周血单核细胞(peripheral blood mononucleated cell，PBMC)样品。所有受试者都是非西班牙裔白人。Statistical Modeling - Individual Sample Predictive Modeling. To achieve robustness in the predictive models, independent gene expression datasets from blood (GSE15289) and PBMC (GSE42057) were obtained and processed from the National Center for Biotechnology Information Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/gds/?term=GEO). The dataset (GSE15289) from the NOWAC study (Dumeaux, V., Olsen, K.S., Nuel, G., Paulssen, R.H., A.-L., and Lund, E. (2010a). Deciphering normal blood gene expression variation—The NOWAC postgenome study. PLoS genetics 6, e1000873) included whole blood samples from 285 postmenopausal women aged between 48 and 63 years, including 211 never smokers and 74 current smokers. The data set (GSE42057) of Bahr et al. (Bahr, T.M., Hughes, G.J., Armstrong, M., Reisdorph, R., Coldren, C.D., Edwards, M.G., Schnell, C., Kedl, R., LaFlamme, D.J. and Reisdorph, N. (2013). Peripheral Blood Mononuclear Cell Gene Expression in Chronic Obstructive Pulmonary Disease. American journal of respiratory cell and molecular biology) was derived from peripheral blood mononuclear cell (PBMC) samples collected from 36 current smokers (22 of whom had COPD and 14 were healthy) and 100 former smokers (72 of whom had COPD and 28 were healthy). All subjects were non-Hispanic white.

使用从GSE15289和GSE42057数据集中抽取的受试者数据来鉴别每个数据集中在吸烟者样品与从不吸烟者(或既往吸烟者)样品之间展现出高的平均表达变化的基因。使L₁和L₂为M个(此处，M＝1000，但一般来说，M可以是任何值)相对于两个独立数据集(GSE15289和GSE42057)来说最高倍数变化基因的集合。为了获得清单L₁，根据吸烟者状况(当前吸烟者和从不吸烟者)对数据集GSE15289进行分类，并且获得每组的平均基因表达水平。当前吸烟者组与从不吸烟者组之间在平均基因表达水平方面的差异在本文中称为倍数变化，并且在集合L₁中包括M个具有最高倍数变化的基因。以类似方式，但针对当前吸烟者和既往吸烟者获得清单L₂。The subject data extracted from GSE15289 and GSE42057 data sets are used to identify the genes that show high average expression changes between smoker samples and never smokers (or former smokers) samples in each data set. Make L ₁ and L ₂ be M (here, M=1000, but in general, M can be any value) the set of the highest fold change genes relative to two independent data sets (GSE15289 and GSE42057). In order to obtain list L ₁ , data set GSE15289 is classified according to smoker status (current smoker and never smoker), and the average gene expression level of each group is obtained. The difference between the current smoker group and the never smoker group in terms of average gene expression level is referred to as fold change in this article, and in set L ₁ , M genes with the highest fold change are included. In a similar manner, but for current smokers and former smokers, list L ₂ is obtained.

图1是用于鉴别一组基因并且基于这组基因获得分类模型的方法100的流程图。具体来说，方法100包括以下步骤：将计数器参数N初始化为1(步骤102)，通过计算马修斯相关系数(Matthews Correlation Coefficient，MCC(N))评估线性判别分析(lineardiscriminant analysis，LDA)模型的性能(步骤104)，并且判定计数器参数是否等于最大计数器值M(判定框106)。如果N小于M，那么方法100前进到步骤108以使N递增并且返回到步骤104以通过计算下一个系数MCC(N)来评估LDA模型的性能。当N达到M(判定框106)时，评估产生最大MCC值的N值(N_MAX)(步骤110)，并且将核心基因清单定义为两组基因L₁[1:N]与L₂[1:N]之间的交集(步骤112)。在鉴别了核心基因清单之后，基于核心基因清单计算LDA模型(步骤114)。Fig. 1 is the flow chart of the method 100 for differentiating one group of genes and obtaining classification model based on this group of genes.Specifically, method 100 comprises the following steps: counter parameter N is initialized to 1 (step 102), by calculating Matthews correlation coefficient (Matthews Correlation Coefficient, MCC (N)) assess the performance of linear discriminant analysis (lineardiscriminantanalysis, LDA) model (step 104), and judge whether counter parameter equals maximum counter value M (decision block 106).If N is less than M, so method 100 advances to step 108 so that N increases progressively and turns back to step 104 to assess the performance of LDA model by calculating next coefficient MCC (N).When N reaches M (decision block 106), assessment produces the N value ( _NMAX ) (step 110) of maximum MCC value, and core gene list is defined as the intersection (step 112) between two groups of genes _L1 [1:N] and _L2 [1:N].After having differentiated core gene list, calculate LDA model (step 114) based on core gene list.

在步骤102，将计数器参数N初始化为1。计数器参数N从1变到最大值M并且在步骤108递增，直到在判定框106，N达到M为止。At step 102 , a counter parameter N is initialized to 1. The counter parameter N goes from 1 to a maximum value M and is incremented at step 108 until N reaches M at decision block 106 .

在步骤104，通过计算系数MCC(N)评估LDA模型的性能。具体来说，可以对L₁[1:N]∩L₂[1:N]使用5折交叉验证(100次)来评估LDA模型的性能，交叉验证为集合L₁中的最高倍数变化N和集合L₂中的最高倍数变化N的交集。通过计算MCC(N)评估LDA模型。MCC度量法组合所有的真/假阳性和阴性率，并且因此提供单值公平度量值。MCC是可以用作复合材料性能评分的性能度量标准。MCC是介于-1与+1之间的值并且基本上是介于已知的二元分类与所预测的二元分类之间的相关系数。MCC可以使用以下方程式计算：In step 104, the performance of the LDA model is evaluated by calculating the coefficient MCC (N). Specifically, the performance of the LDA model can be evaluated using a 5-fold cross validation (100 times) for _L1 [1:N] ∩ _L2 [1:N], where the cross validation is the intersection of the highest multiple change N in the set _L1 and the highest multiple change N in the set _L2 . The LDA model is evaluated by calculating MCC (N). The MCC metric combines all true/false positive and negative rates and therefore provides a single-valued fair metric. MCC is a performance metric that can be used as a composite material performance score. MCC is a value between -1 and +1 and is essentially a correlation coefficient between a known binary classification and a predicted binary classification. MCC can be calculated using the following equation:

其中TP：真阳性；FP：假阳性；TN：真阴性；FN：假阴性。然而，一般来说，基于一组性能度量标准产生复合材料性能度量标准的任何合适的技术都可以用来评定LDA模型的性能。MCC值+1表示所述模型获得了完美的预测，MCC值0表示所述模型预测几乎就是随机的，并且MCC值-1表示所述模型预测完全不准确。MCC的优势在于当以仅类别预测可用的方式编码分类器函数时能够容易计算。相比之下，对于曲线下面积(area under the curve，AUC)计算，需要分类器函数提供数值评分。然而，一般来说，根据本公开内容，可以使用任何解释TP、FP、TN和FN的度量标准。Where TP: True Positive; FP: False Positive; TN: True Negative; FN: False Negative. However, in general, any suitable technique that produces a composite performance metric based on a set of performance metrics can be used to assess the performance of an LDA model. An MCC value of +1 indicates that the model obtains perfect predictions, an MCC value of 0 indicates that the model predictions are almost random, and an MCC value of -1 indicates that the model predictions are completely inaccurate. The advantage of MCC is that it can be easily calculated when the classifier function is encoded in a way that only class predictions are available. In contrast, for area under the curve (AUC) calculations, the classifier function is required to provide a numerical score. However, in general, any metric that accounts for TP, FP, TN, and FN can be used in accordance with the present disclosure.

为了计算MCC，首先应该选择分类类别的集合。从从不吸烟者、既往吸烟者和当前吸烟者获得BLD-SMK-01数据集。图4A、4B和4C显示BLK-SMK-01样品中的差异表达基因的火山图。每个火山图显示所估算的log2(倍数变化)对比log10(经过调整的P值)。基于适中的t统计法计算P值并且通过本亚明霍赫贝格法加以调整。确切地说，图4A比较当前吸烟者与非吸烟者之间的基因表达谱，图4B比较当前吸烟者与既往吸烟者之间的基因表达谱，并且图4C比较既往吸烟者与从不吸烟者之间的基因表达谱。图4C中所示的火山图指出，在从不吸烟者与既往吸烟者之间无差异基因表达变化(即，在图4C中未观察到趋势)，但图4A和4B指出，在当前吸烟者与从不吸烟者之间(图4A)并且在当前吸烟者与既往吸烟者之间(图4B)观察到了很多差异基因表达变化。In order to calculate MCC, the set of classification categories should first be selected. The BLD-SMK-01 data set is obtained from never smokers, former smokers and current smokers. Figures 4 A, 4B and 4C show the volcano plots of the differentially expressed genes in the BLK-SMK-01 sample. Each volcano plot shows the estimated log2 (fold change) versus log10 (adjusted P value). The P value is calculated based on a moderate t-statistic and adjusted by the Benjamin-Hochberg method. Specifically, Figure 4 A compares the gene expression profiles between current smokers and non-smokers, Figure 4 B compares the gene expression profiles between current smokers and former smokers, and Figure 4 C compares the gene expression profiles between former smokers and never smokers. The volcano plot shown in Figure 4C indicates that there were no differential gene expression changes between never smokers and former smokers (i.e., no trend was observed in Figure 4C), but Figures 4A and 4B indicate that many differential gene expression changes were observed between current smokers and never smokers (Figure 4A) and between current smokers and former smokers (Figure 4B).

因此，BLD-SMK-01样品的群体水平转录组学分析表明，在从不吸烟者与既往吸烟者之间在全血中不存在差异基因表达变化，并且因此，基于血液转录组来区分既往吸烟者与从不吸烟者将是非常具有挑战性的。反之，在当前吸烟者与从不吸烟者之间和在当前吸烟者与既往吸烟者之间存在很多差异表达基因(图4A和4B)。因为在从不吸烟者群体与既往吸烟者群体之间未观察到差异，所以在步骤104仅使用两个类别来评估模型：当前吸烟者和非当前吸烟者。Therefore, the colony level transcriptomic analysis of BLD-SMK-01 sample shows that there is not differential gene expression change between never smokers and former smokers in whole blood, and therefore, distinguishing former smokers from never smokers based on blood transcriptome will be very challenging.On the contrary, there are a lot of differentially expressed genes (Fig. 4 A and 4B) between current smokers and never smokers and between current smokers and former smokers.Because difference is not observed between never smoker colony and former smoker colony, only two categories are used to evaluate model in step 104: current smoker and non-current smoker.

确切地说，在步骤104，基因集合L₁[1:N]∩L₂[1:N]对应于两个独立数据集GSE15289和GSE42057的最高倍数变化N的交集。交叉验证基于L₁[1:N]或L₂[1:N]的每个预测模型，评定LDA模型的结果是否能推广到独立数据集。在一个实例中，为了对L₁[1:N]基因集执行一例5折交叉验证，将L₁[1:N]集合随机地分成五个子集：A、B、C、D和E。四个(A、B、C和D)子集被用来使用LDA技术训练分类器，并且第五个子集(E)被用来测试针对另外四个子集进行训练的分类器。训练和测试过程另外重复四次，其中其它子集(A、B、C和D)中的每一个都被用作测试子集来测试针对另外四个子集进行训练的分类器。Specifically, in step 104, the gene set _L1 [1:N] _∩L2 [1:N] corresponds to the intersection of the highest fold change N of the two independent data sets GSE15289 and GSE42057. Cross-validation is performed based on each prediction model of _L1 [1:N] or _L2 [1:N] to assess whether the results of the LDA model can be generalized to the independent data set. In one example, in order to perform a 5-fold cross-validation on the _L1 [1:N] gene set, the _L1 [1:N] set is randomly divided into five subsets: A, B, C, D, and E. Four (A, B, C, and D) subsets are used to train classifiers using the LDA technique, and the fifth subset (E) is used to test the classifiers trained on the other four subsets. The training and testing process is repeated four more times, wherein each of the other subsets (A, B, C, and D) is used as a test subset to test the classifiers trained on the other four subsets.

一般来说，LDA技术的准则是将描述n个特征的输入向量x归类为y类。所述分类是基于某个函数，所述函数是所观察到的特征的线性组合。基于数据的训练子集估算线性组合的系数。确切地说，为了使用LDA技术训练分类器，从四个训练子集中鉴别出数据中基因表达水平的线性组合。线性组合在本文中称为分类器并且在所预测的吸烟者状况与所预测的非吸烟者状况之间限定边界。分类器用于获得测试子集中的每个个体的预测状况。这个过程另外重复四次，使得五个子集各自作为测试子集处理一次。在五个子集各自都已作为测试子集一次之后，一例5折交叉验证完成，并且训练数据观察结果(具有L₁[1:N]∩L₂[1:N]集合中的特征)被分成五个新的子集A’、B’、C’、D’和E’，从而引发第二例5折交叉验证。In general, the criterion of LDA technology is to classify the input vector x describing n features into y classes. The classification is based on a certain function, which is a linear combination of the observed features. The coefficients of the linear combination are estimated based on the training subset of the data. Specifically, in order to use the LDA technology to train a classifier, a linear combination of gene expression levels in the data is identified from four training subsets. The linear combination is referred to as a classifier in this article and defines a boundary between the predicted smoker status and the predicted non-smoker status. The classifier is used to obtain the predicted status of each individual in the test subset. This process is repeated four times in addition, so that the five subsets are each processed as the test subset once. After the five subsets have each been used as the test subset once, a 5-fold cross validation is completed, and the training data observations (with the features in the L ₁ [1:N] ∩ L ₂ [1:N] set) are divided into five new subsets A ', B ', C ', D ' and E ', thereby causing the second 5-fold cross validation.

本文所述的实例是100例5折交叉验证的结果，但一般来说，本领域的普通技术人员将理解，可以在不脱离本公开内容的范围的情况下使用任何例数的k折交叉验证。此外，本文所述的实例是LDA技术的结果，LDA技术基于基因表达水平的线性组合形成分类器。然而，一般来说，本领域的普通技术人员将理解，可以使用基因表达水平的任何函数来形成分类器，如二次函数、多项式函数、指数函数或可以在R^N中形成一维流形来定义分类器的任何其它合适的函数。The examples described herein are the results of 100 5-fold cross validations, but generally speaking, one of ordinary skill in the art will appreciate that any number of k-fold cross validations can be used without departing from the scope of this disclosure. In addition, the examples described herein are the results of the LDA technique, which forms a classifier based on linear combinations of gene expression levels. However, generally speaking, one of ordinary skill in the art will appreciate that any function of gene expression levels can be used to form a classifier, such as a quadratic function, a polynomial function, an exponential function, or any other suitable function that can form a one-dimensional manifold in R^N to define a classifier.

在步骤110，在N达到最大数目M之后，考虑MCC的M值的集合，并且将对应于MCC的最大值的N值评为N_max＝argmax_N(MCC(N))。如图1中所示，在已计算出MCC的所有M值之后执行N_max的评估步骤。然而，一般来说，本领域的普通技术人员将理解，可替代地，可以将在步骤104计算的值MCC(N)与一些预定阈值进行比较，然后评估下一个值MCC(N+1)。在这种情况下，当发现值MCC超过预定阈值时，方法100可以直接前进到步骤110，将N_max的值赋给当前的N值，不考虑其余的N值＝N_max+1到M。At step 110, after N reaches a maximum number M, a set of M values of the MCC is considered, and the N value corresponding to the maximum value of the MCC is evaluated as N _max = argmax _N (MCC(N)). As shown in FIG1 , the N _max evaluation step is performed after all M values of the MCC have been calculated. However, generally speaking, one of ordinary skill in the art will appreciate that, alternatively, the value MCC(N) calculated at step 104 can be compared to some predetermined threshold before evaluating the next value MCC(N+1). In this case, if the value MCC is found to exceed the predetermined threshold, method 100 can proceed directly to step 110 and assign the value of N _max to the current N value, disregarding the remaining N values = N _max + 1 to M.

在步骤112，标签的核心基因清单由交集L₁[1:N_max]∩L₂[1:N_max]，或处于L₁[1:N_max]和L₂[1:N_max]两者中的基因的集合定义。如此实例中所述，仅使用两个数据集L₁和L₂。然而，一般来说，本领域的普通技术人员将理解，可以使用任何数量的数据集来计算MCC值和鉴别用于定义基因标签的核心基因集。具体来说，可以使用m个数据集的交集或成对交集的并集。In step 112, the core gene list of the signature is defined by the intersection _L1 [1: _Nmax ] _∩L2 [1: _Nmax ], or the set of genes that are in both _L1 [1: _Nmax ] and _L2 [1: _Nmax ]. As described in this example, only two data sets, _L1 and _L2 , are used. However, in general, one of ordinary skill in the art will appreciate that any number of data sets can be used to calculate MCC values and identify the core gene set used to define the gene signature. Specifically, the intersection of m data sets or the union of pairwise intersections can be used.

在步骤114，使用在步骤112确定的核心基因清单计算LDA模型。具体来说，基于核心基因清单计算的LDA模型可以通过进行100次5折交叉验证或任何数量的n折交叉验证来计算。In step 114, an LDA model is calculated using the core gene list determined in step 112. Specifically, the LDA model calculated based on the core gene list can be calculated by performing 100 5-fold cross validations or any number of n-fold cross validations.

在一个实例中，应用关于步骤102到114所述的统计建模方法，鉴别出核心基因标签包括以下六个基因：LRRN3、SASH1、PALLD、RGL1、TNFRSF17以及CDKN1C。当对从当前吸烟者和从不吸烟者获得的样品进行分类时，此模型的5折交叉验证(100次)MCC是0.77(其中灵敏度评分(sensitivity score，Se)是0.91并且特异性评分(specificity score，Sp)是0.85)。通过所述方法的设计，标签中的核心基因在NOWAC(GSE15289)和Bahr等人(GSE42057)研究中均处于高倍数变化基因中，并且所述预测基于这两个GSE研究之间的所有77个共同基因，改良了LDA模型的性能(Se＝0.73，Sp＝0.81)。尽管所有六个基因LRRN3、SASH1、PALLD、RGL1、TNFRSF17以及CDKN1C在本文中全都称为核心基因标签，但是本领域的普通技术人员将理解，可以使用六个基因的任意组合作为核心基因标签，如六个基因中的三个、四个或五个的任意组合。In one example, the statistical modeling method described in steps 102 to 114 was applied to identify a core gene signature comprising the following six genes: LRRN3, SASH1, PALLD, RGL1, TNFRSF17, and CDKN1C. When classifying samples from current smokers and never smokers, the 5-fold cross-validation (100 times) MCC of this model was 0.77 (with a sensitivity score (Se) of 0.91 and a specificity score (Sp) of 0.85). By design of the method, the core genes in the signature were among the high-fold change genes in both the NOWAC (GSE15289) and Bahr et al. (GSE42057) studies, and the prediction was based on all 77 common genes between the two GSE studies, improving the performance of the LDA model (Se = 0.73, Sp = 0.81). Although all six genes LRRN3, SASH1, PALLD, RGL1, TNFRSF17 and CDKN1C are referred to herein as a core gene signature, one of ordinary skill in the art will understand that any combination of the six genes can be used as a core gene signature, such as any combination of three, four or five of the six genes.

在一些实施例中，扩充标签中的基因以包括扩展的基因集合，所述扩展的基因集合包括不在核心集合中但与高特异性评分和高灵敏度评分相关的其它基因。具体来说，当研究通过单独地利用高倍数变化基因的每个清单获得的预测模型时，重复地将IGJ、RRM2、ID3、SERPING1以及FUCA1鉴别为具有高特异性和灵敏度的标签中的潜在候选者。这五个基因也在NOWAC(当前吸烟者对比从不吸烟者)和Bahr等人(当前吸烟者对比既往吸烟者)研究的血液转录组中的高倍数变化基因中并且被用来将核心基因标签扩展为扩展标签。当对当前吸烟者和从不吸烟者进行分类时，所述模型基于扩展标签(LRRN3、SASH1、PALLD、RGL1、TNFRSF17、CDKN1C、IGJ、RRM2、ID3、SERPING1以及FUCA1)的交叉验证MCC是0.73(Se＝0.88，Sp＝0.84)。尽管所有十一个基因LRRN3、SASH1、PALLD、RGL1、TNFRSF17、CDKN1C、IGJ、RRM2、ID3、SERPING1以及FUCA1在本文中全都称为扩展基因标签，但是本领域的普通技术人员将理解，可以使用十一个基因的任意组合作为核心基因标签，如十一个基因中的五个、六个、七个、八个、九个或十个的任意组合。此外，组合可以包括核心基因标签中的六个基因中的三个、四个或五个的组合和扩展基因标签中额外基因中的五个基因中的两个、三个或四个。In some embodiments, the genes in the signature are expanded to include an extended gene set that includes other genes that are not in the core set but are associated with high specificity scores and high sensitivity scores. Specifically, when studying the predictive model obtained by utilizing each list of high-fold change genes individually, IGJ, RRM2, ID3, SERPING1, and FUCA1 were repeatedly identified as potential candidates in the signature with high specificity and sensitivity. These five genes are also among the high-fold change genes in the blood transcriptome studied by NOWAC (current smokers vs. never smokers) and Bahr et al. (current smokers vs. former smokers) and are used to expand the core gene signature to an extended signature. When current smokers and never smokers were classified, the cross-validated MCC of the model based on the extended signature (LRRN3, SASH1, PALLD, RGL1, TNFRSF17, CDKN1C, IGJ, RRM2, ID3, SERPING1, and FUCA1) was 0.73 (Se=0.88, Sp=0.84). Although all eleven genes LRRN3, SASH1, PALLD, RGL1, TNFRSF17, CDKN1C, IGJ, RRM2, ID3, SERPING1, and FUCA1 are collectively referred to herein as an extended gene signature, one of ordinary skill in the art will appreciate that any combination of the eleven genes can be used as a core gene signature, such as any combination of five, six, seven, eight, nine, or ten of the eleven genes. In addition, the combination can include a combination of three, four, or five of the six genes in the core gene signature and two, three, or four of the five genes in the additional genes in the extended gene signature.

将在步骤114计算的LDA模型的结果与在单独从BLD-SMK-01(也就是说不使用两个公共数据集GSE15289和GSE42057)学习稀疏标签时获得的模型的预测交叉验证结果进行对比。在预测吸烟者对比非吸烟者时，此模型的5折交叉验证性能产生Sp＝0.96和Se＝0.93，所述性能略高于基于核心标签和扩展标签的模型性能。尽管用本文所述的方法衍生的预测模型的交叉验证特异性和灵敏度(Sp＝0.88，Se＝0.84)导致性能略低于不使用独立数据集获得的模型的性能(Sp＝0.96，Se＝0.93)，但是本文中所衍生的预测模型是有利的，因为所述模型与更宽范围的应用相关。具体来说，根据本公开内容的方法衍生的预测模型是稳固的，如当验证所述模型时所展现，如关于步骤116详细地描述。The results of the LDA model calculated in step 114 are compared with the prediction cross-validation results of the model obtained when learning sparse labels from BLD-SMK-01 alone (that is, without using two public datasets GSE15289 and GSE42057). When predicting smokers versus non-smokers, the 5-fold cross-validation performance of this model produces Sp=0.96 and Se=0.93, which is slightly higher than the performance of the model based on core labels and extended labels. Although the cross-validation specificity and sensitivity (Sp=0.88, Se=0.84) of the prediction model derived by the method described herein result in performance slightly lower than the performance (Sp=0.96, Se=0.93) of the model obtained without using an independent dataset, the prediction model derived herein is advantageous because the model is relevant to a wider range of applications. Specifically, the prediction model derived according to the method of the present disclosure is stable, as demonstrated when validating the model, as described in detail with respect to step 116.

在步骤116，验证在步骤114计算的LDA模型。通过使用来自BLD-SMK-01研究的既往吸烟者组和来自QASMC研究的血液数据集执行LDA模型的验证。在对QASMC转录组学样品进行质量检查之后，52份COPD、58份当前吸烟者、58份既往吸烟者以及59份从不吸烟者CEL文件可供预测。为了评估核心标签和扩展标签的预测性能，将QASMC样品分为两组：当前吸烟者(COPD和健康的)和非当前吸烟者(包括既往吸烟者和从不吸烟者)。这两组允许评估所述标签相对于COPD状况的稳固性。使用针对核心基因标签或扩展标签构建的模型预测每个居中数据集。In step 116, verify the LDA model calculated in step 114.By using the former smoker group from BLD-SMK-01 research and the blood data set from QASMC research, perform the verification of LDA model.After QASMC transcriptomic samples are carried out quality control, 52 parts of COPD, 58 parts of current smokers, 58 parts of former smokers and 59 parts of never smokers CEL files are available for prediction.In order to assess the predictive performance of core label and extended label, QASMC samples are divided into two groups: current smoker (COPD and health) and non-current smoker (comprising former smoker and never smoker).These two groups allow to assess the robustness of described label with respect to COPD situation.Use the model prediction each centered data set built for core gene label or extended label.

表3显示对各种标签的独立数据集使用LDA模型所得的预测结果。表3的格式遵循表1的格式，其中在不同行中显示预测分类并且在不同列中显示实际分类。具体来说，表3中所示的预测结果包括以下各者的预测结果：核心基因标签(前三行)、扩展基因标签(中间三行)、仅源自BLD-SMK-01样品的标签(倒数第二行)以及基于Beineke等人(Beineke,P.,Fitch,K.,Tao,H.,Elashoff,M.R.,Rosenberg,S.,Kraus,W.E.和Wingrove,J.A.(2012).用于吸烟状态的基于全血基因表达的特征(A whole blood gene expression-basedsignature for smoking status).BMC药物基因组学(BMC medical genomics)5,58.)中所述的基因集的标签(最后一行)。如表3中所示，核心标签和扩展标签均使得灵敏度和特异性评分高于源自单独的BLD-SMK-01样品的标签和由Beineke鉴别的标签。Table 3 shows the prediction results obtained by using the LDA model for independent data sets of various labels. The format of Table 3 follows the format of Table 1, wherein the predicted classification is shown in different rows and the actual classification is shown in different columns. Specifically, the prediction results shown in Table 3 include the prediction results of the following: core gene labels (first three rows), extended gene labels (middle three rows), labels derived only from BLD-SMK-01 samples (second to last row) and labels (last row) of the gene sets described in Beineke et al. (Beineke, P., Fitch, K., Tao, H., Elashoff, M.R., Rosenberg, S., Kraus, W.E. and Wingrove, J.A. (2012). A whole blood gene expression-based signature for smoking status. BMC pharmacogenomics (BMC medical genomics) 5, 58.). As shown in Table 3, both the core and extended signatures resulted in higher sensitivity and specificity scores than the signature derived from the BLD-SMK-01 sample alone and the signature identified by Beineke.

表3Table 3

所述标签针对QASMC研究的分类性能证实，无论COPD状况如何，所述模型都是稳固的(针对核心标签，Se＝0.9，Sp＝0.9；并且针对扩展标签，Se＝0.91，Sp＝0.90)。The classification performance of the signature against the QASMC study demonstrated that the model was robust regardless of COPD status (Se = 0.9, Sp = 0.9 for the core signature and Se = 0.91, Sp = 0.90 for the extended signature).

此外，图5A、5B、5D和5E显示各个盒形图，所述盒形图指示不同研究的分类方案。具体来说，图5A和5B分别绘制了根据BLD-SMK-01研究和QASMC研究的LDA模型，某个样品被归类为当前吸烟者的后验概率的盒形图。图5D和5E分别绘制了关于BLD-SMK-01研究和QASMC研究，线性判别函数的预测评分的盒形图。具体来说，将具有负分的样品归类为当前吸烟者，并且将具有正分的样品归类为非当前吸烟者。In addition, Figures 5A, 5B, 5D, and 5E show individual box plots indicating the classification schemes for different studies. Specifically, Figures 5A and 5B plot box plots of the posterior probability of a sample being classified as a current smoker based on the LDA model for the BLD-SMK-01 study and the QASMC study, respectively. Figures 5D and 5E plot box plots of the predicted scores of the linear discriminant function for the BLD-SMK-01 study and the QASMC study, respectively. Specifically, samples with negative scores were classified as current smokers, and samples with positive scores were classified as non-current smokers.

还检查了如性别和年龄等其它共变量的影响。BLD-SMK-01和QASMC研究相对于性别和年龄取平衡。在年龄或性别与吸烟状况之间不存在统计学关联，如通过统计学卡方检验(关于BLD-SMK-01，χ²(性别，吸烟状况)P值＝1；并且关于QASMC，χ²(性别，吸烟状况)P值＝0.9)和统计学t检验(关于BLD-SMK-01，t检验(年龄对比吸烟状况)P值＝0.8；并且关于QASMC，t检验(年龄对比吸烟状况)P值＝0.46)所指出。Also checked the influence of other covariates such as sex and age.BLD-SMK-01 and QASMC study are balanced with respect to sex and age.There is no statistical association between age or sex and smoking status, as indicated by statistical chi-square test (about BLD-SMK-01, χ ² (sex, smoking status) P value=1; and about QASMC, χ ² (sex, smoking status) P value=0.9) and statistical t-test (about BLD-SMK-01, t-test (age vs. smoking status) P value=0.8; and about QASMC, t-test (age vs. smoking status) P value=0.46).

另外，在BLD-SMK-01中，测试标签中的每个基因与性别和年龄的关联，并且没有一个基因的ANOVA P值低于0.05，但PALLD基因显示微弱的性别影响。先前鉴别的基因标签发现了性别和/或年龄的影响并且确定了必需针对这类因素加以调整。Beineke等人2012。具体来说，年龄是两个公共数据集(GSE15289和GSE42057)的重要共变量，因为吸烟者平均年龄大于从不吸烟者或既往吸烟者，所以此共变量不包括在预测器中，因为它在BLD-SMK-01研究中与吸烟状况并没有统计学关联。然而，除了如由特异性和灵敏度评分定义的较佳性能之外，本文所述的基因标签通常与性别或年龄无关。这一点说明，本文所述的核心标签和扩展标签提供一种优于已知基因标签的优势，所述优势在于不必针对这些因素加以调整，从而简化了计算方法。Additionally, in BLD-SMK-01, each gene in the signature was tested for association with sex and age, and none had an ANOVA P-value below 0.05, although the PALLD gene showed a weak sex effect. Previously identified gene signatures have found effects of sex and/or age and established the need to adjust for these factors. Beineke et al. 2012. Specifically, age is a significant covariate in two public datasets (GSE15289 and GSE42057), because smokers are on average older than never or former smokers. Therefore, this covariate was not included in the predictor because it was not statistically associated with smoking status in the BLD-SMK-01 study. However, despite its superior performance, as defined by specificity and sensitivity scores, the gene signatures described herein were generally unrelated to sex or age. This suggests that the core and extended signatures described herein offer an advantage over known gene signatures in that they do not have to adjust for these factors, thereby simplifying the computational approach.

为了确定所发现的标签是否能够被转译到基于qRT-PCR的暴露生物标志物中，使二十个随机选择的样品(十名当前吸烟者和十名从不吸烟者)的子集经历qRT-PCR，测量扩展标签中的基因的表达水平。基于扩展标签中的基因，针对归一化的qRT-PCR数据训练LDA模型，并且通过10折交叉验证(1000次，选择10折是因为样品大小小)评定，得到特异性0.85和灵敏度0.96(表4)。当对核心标签应用相同操作时，获得特异性0.62和较低的灵敏度0.80(表4)。In order to determine whether the label found can be translated into the exposure biomarker based on qRT-PCR, the subset of 20 randomly selected samples (10 current smokers and 10 never smokers) was subjected to qRT-PCR to measure the expression level of the gene in the extended label. Based on the gene in the extended label, the qRT-PCR data training LDA model for normalization was used, and evaluated by 10 folding cross validations (1000 times, selecting 10 foldings was because sample size was small), and specificity 0.85 and sensitivity 0.96 (table 4) were obtained. When the core label was applied with the same operation, specificity 0.62 and lower sensitivity 0.80 (table 4) were obtained.

表4Table 4

本公开内容的一个目标是应用核心基因标签和扩展基因标签来判定是否可以使用所述标签检测转换到加热式烟草产品(heated tobacco product，HTP)的影响。为了促进这个目标，从REX-EX-01研究获得数据。REX-EX-01研究是开放标签、随机、对照、双臂平行组研究，所述研究招募了42名年龄介于23岁与65岁之间的健康吸烟者，包括两个性别。进行所述研究，比较常规香烟的吸烟者与近期连续5天转换到HTP(本文中称为烟草加热系统2.1(THS 2.1))的吸烟者。根据优质临床规范(Good Clinical Practices，GCP)进行研究并且所述研究已在ClinicalTrials.gov上以标识符NCT01780714注册。将血液样品储存在PAXgene管中并且传送到丹麦奥尔胡斯的AROS应用生物技术公司，所述血液样品在此公司进行进一步处理并且然后与昂飞人类基因组U133Plus 2.0基因芯片杂交。One goal of the present disclosure is to use a core gene signature and an extended gene signature to determine whether the signature can be used to detect the effects of switching to heated tobacco products (HTPs). To facilitate this goal, data were obtained from the REX-EX-01 study. The REX-EX-01 study was an open-label, randomized, controlled, two-arm, parallel-group study that enrolled 42 healthy smokers aged between 23 and 65 years of age, encompassing both sexes. The study was conducted to compare smokers of conventional cigarettes with smokers who had recently switched to an HTP (referred to herein as Tobacco Heating System 2.1 (THS 2.1)) for five consecutive days. The study was conducted according to Good Clinical Practices (GCP) and is registered on ClinicalTrials.gov with identifier NCT01780714. Blood samples were stored in PAXgene tubes and transported to AROS Applied Biotech in Aarhus, Denmark, where they were further processed and then hybridized to the Affymetrix Human Genome U133 Plus 2.0 gene chip.

为了测试本文所鉴别的基因标签是否提供了用于评定临床试验中的暴露反应的灵敏并且非侵袭性工具，将所述标签应用于THS 2.1数据，判定是否可以在五天后在全血转录组中检测到转换到HTP。本研究的假设是转换到THS 2.1的吸烟者的全血转录组与既往吸烟者的全血转录组的类似性多过与当前吸烟者的全血转录组的类似性。代替表征对转换五天具有特异性的HTP用户的基因表达谱(例如，通过从REX-EX-01研究数据中提取标签)，期望鉴别出基于转录组的暴露反应标签，所述标签还可以充当更长期的转换模式的指示器。这一点通过确立核心基因标签和扩展基因标签实现，核心基因标签和扩展基因标签均能够区分当前吸烟者样品与非当前吸烟者样品。In order to test whether the gene signature identified herein provides a sensitive and non-invasive tool for assessing the exposure response in clinical trials, the label is applied to THS 2.1 data, it is determined whether the switch to HTP can be detected in the whole blood transcriptome after five days. The hypothesis of this research is that the whole blood transcriptome of the smoker switched to THS 2.1 is more similar to the whole blood transcriptome of the smoker in the past than the similarity with the whole blood transcriptome of the current smoker. Instead of characterizing the gene expression profile (for example, by extracting label from REX-EX-01 research data) of the specific HTP user who switched for five days, it is expected to identify the exposure response label based on transcriptome, and the label can also serve as the indicator of longer-term conversion mode. This is achieved by establishing core gene label and extended gene label, and core gene label and extended gene label can both distinguish current smoker sample and non-current smoker sample.

在对REX-EX-01研究的CEL文件执行质量检查之后，常规香烟吸烟者和THS 2.1用户在第5天分别剩余16和18个文件。下表5显示REX-EX-01样品针对核心基因标签(前三行)和扩展基因标签(后三行)的预测结果。关于扩展基因标签，仍然吸常规香烟的个体(当前吸烟者)主要被归类为当前吸烟者(69％)，而转换到THS 2.1的受试者主要被归类为非当前吸烟者(89％)。关于核心标签，当前吸烟者的正确率相同(69％)，并且78％转换到THS 2.1的受试者被归类为非当前吸烟者。因此，核心基因标签和扩展基因标签均可预测从HTP用户获得的样品为非当前吸烟者的样品。After quality control was performed on the CEL files of the REX-EX-01 study, 16 and 18 files remained for conventional cigarette smokers and THS 2.1 users, respectively, on the 5th day. Table 5 below shows the prediction results of the REX-EX-01 samples for the core gene signature (first three rows) and the extended gene signature (last three rows). With regard to the extended gene signature, individuals who still smoke conventional cigarettes (current smokers) were mainly classified as current smokers (69%), while subjects who switched to THS 2.1 were mainly classified as non-current smokers (89%). With regard to the core signature, the accuracy rate of current smokers was the same (69%), and 78% of subjects who switched to THS 2.1 were classified as non-current smokers. Therefore, both the core gene signature and the extended gene signature can predict that the samples obtained from HTP users are samples of non-current smokers.

表5Table 5

表5中所示的结果与转换到THS的受试者的血液转录组开始类似于既往吸烟者而不是当前吸烟者的血液转录组这一初始假设一致，但仍存在以下事实：在THS 2.1与常规香烟之间在烟碱和可替宁(cotinine)暴露方面无显著差异(数据未示出)。The results shown in Table 5 are consistent with the initial hypothesis that the blood transcriptome of subjects switching to THS begins to resemble that of former rather than current smokers, despite the fact that there were no significant differences in nicotine and cotinine exposure between THS 2.1 and conventional cigarettes (data not shown).

此外，图5C绘制了从LDA模型，基于REX-EX-01数据，某个样品被归类为当前吸烟者的后验概率的盒形图，并且图5F绘制了基于REX-EX-01数据，来自线性判别函数的预测评分的盒形图。具有负预测评分的样品被归类为当前吸烟者，而正预测评分指示非当前吸烟者状况。In addition, Figure 5C plots a box plot of the posterior probability of a sample being classified as a current smoker based on the REX-EX-01 data from the LDA model, and Figure 5F plots a box plot of the prediction scores from the linear discriminant function based on the REX-EX-01 data. Samples with negative prediction scores are classified as current smokers, while positive prediction scores indicate non-current smoker status.

与依赖于单一基因的测量的基因标签相比，基因表达谱分析提供了在正常情况和病理情况下的生物过程的全面并且更完整的视图。当将多个基因的表达趋势综合在一起时，还有可能从对于疾病状态的暴露反应推导出用于指定生理状态的标签或分类器。虽然受到主要影响的组织可提供更准确地代表正常状态、暴露的状态或病理状态的样品，但是使用组织活检对受试者进行分类常常是不实际的。因为使用微创技术采血方便，所以基于血液的标签在生物标志物探索方面有巨大的前景。在这一研究中，已鉴别出两组基于全血的生物标志物，其中任一者均可以充当身体对吸烟的反应的标签并且因此可以用作个体吸烟状况的强预测器。Compared with the gene label of the measurement that relies on single gene, gene expression spectrum analysis provides the comprehensive and more complete view of the biological process under normal situation and pathological situation.When the expression trend of multiple genes is combined together, it is also possible to derive the label or the classifier for specifying physiological state from the exposure reaction for disease state.Although the tissue that is mainly affected can provide the sample representing the state of normal state, exposure or pathological state more accurately, it is usually impractical to use tissue biopsy to classify the experimenter.Because it is convenient to draw blood using minimally invasive technology, huge prospects are arranged aspect biomarker exploration based on the label of blood.In this research, two groups of biomarkers based on whole blood have been identified, wherein either one can serve as the label of the reaction of health to smoking and therefore can be used as the strong predictor of individual smoking status.

在这一研究中强烈突出的基因是LRRN3。在当前吸烟者中，LRRN3的表达相比于在非当前吸烟者中有所增加。在REX-EX-01研究中，在转换到HTP的受试者的血液中，所述表达在0天与5天之间显著降低，并且在仍然吸常规香烟的受试者的血液中保持恒定。因此，LRRN3似乎是核心标签和扩展标签中用于测量从常规香烟转换到HTP的影响到重要基因。在一个实例中，如所述的基因标签仅包括LRRN3并且无其它基因，或包括LRRN3以及任何其它基因。具体来说，包括LRRN3的基因标签能够通过证明在转换之后在0天与5天之间，LRRN3表达降低，而检测到从吸常规香烟到使用HTP的转换。A gene that stood out strongly in this study was LRRN3. In current smokers, expression of LRRN3 was increased compared to non-current smokers. In the REX-EX-01 study, expression decreased significantly between days 0 and 5 in the blood of subjects who switched to HTPs, and remained constant in the blood of subjects who continued to smoke conventional cigarettes. Therefore, LRRN3 appears to be an important gene in both the core and extended signatures for measuring the effects of switching from conventional cigarettes to HTPs. In one example, a gene signature as described includes only LRRN3 and no other genes, or includes LRRN3 and any other genes. Specifically, a gene signature including LRRN3 is able to detect a switch from smoking conventional cigarettes to using HTPs by demonstrating a decrease in LRRN3 expression between days 0 and 5 after the switch.

本文所述的系统药理学方法允许构造一个或多个可以区分当前吸烟者与非当前吸烟者的基于全血的稳固吸烟者基因标签。本文所述的核心基因标签是基于六个基因，并且扩展基因标签是基于核心基因标签加另外五个基因。两种基因标签在预测个体的吸烟者状况方面均具有显著的准确性，如通过灵敏度和特异性评分所评定。当向来自REX-EX-01研究的样品应用时，所述标签基于全血转录组数据将使用THS 2.1五天后的受试者鉴别为非当前吸烟者。因此，本文所述的标签提供了一种使用微创采样评定暴露反应的灵敏并且特定的工具。Systems pharmacology method as herein described allows to construct one or more solid smoker gene signatures based on whole blood that can distinguish current smoker and non-current smoker.Core gene signature as herein described is based on six genes, and extended gene signature is based on core gene signature plus other five genes.Two kinds of gene signatures all have significant accuracy aspect the smoker's status of prediction individuality, as assessed by sensitivity and specificity score.When to the sample application from REX-EX-01 research, described label will use THS 2.1 experimenter after five days to be differentiated as non-current smoker based on whole blood transcriptome data.Therefore, label as herein described provides a kind of sensitive and specific instrument of using minimally invasive sampling assessment exposure response.

图2是根据本公开内容的示意性实施例，用于评定从受试者获得的样品的方法200的流程图。方法200包括以下步骤：接收与样品相关的数据集，所述数据集包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17的定量表达数据(步骤202)；并且基于接收到的数据集产生评分，其中所述评分指示受试者的预测吸烟状况(步骤204)。在一些实施例中，在步骤202接收到的数据集进一步包括IGJ、RRM2、SERPING1、FUCA1以及ID3的定量表达数据。在一些实施例中，在步骤202接收到的数据集进一步包括CLDND1、MUC1、GOPC以及LEF1中的一个或多个的定量表达数据。FIG2 is a flow chart of a method 200 for assessing a sample obtained from a subject, according to an illustrative embodiment of the present disclosure. Method 200 includes the following steps: receiving a dataset associated with the sample, the dataset comprising quantitative expression data for LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17 (step 202); and generating a score based on the received dataset, wherein the score indicates a predicted smoking status of the subject (step 204). In some embodiments, the dataset received in step 202 further comprises quantitative expression data for IGJ, RRM2, SERPING1, FUCA1, and ID3. In some embodiments, the dataset received in step 202 further comprises quantitative expression data for one or more of CLDND1, MUC1, GOPC, and LEF1.

在步骤204产生的评分是向数据集应用的分类方案的结果，其中所述分类方案是基于数据集中的定量表达数据确定的。具体来说，在本文所述的实例中，可以向在202接收到的数据集应用针对LDA模型加以训练的分类器，确定个体的预测分类法。The score generated at step 204 is the result of applying a classification scheme to the dataset, wherein the classification scheme is determined based on the quantitative expression data in the dataset. Specifically, in the example described herein, a classifier trained on an LDA model can be applied to the dataset received at 202 to determine a predicted classification for the individual.

本文所述的基因标签可以在由计算机实施的方法中使用，用于评定从受试者获得的样品。具体来说，可以获得与样品相关的数据集，并且所述数据集可以包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17的定量表达数据用于核心基因标签。可以基于接收到的数据集产生评分，其中所述评分指示所预测的受试者的吸烟状况。具体来说，所述评分可以基于使用本文所述的LDA模型方法构建的分类器。数据集可以进一步包括扩展基因标签中所包括的其它标志物IGJ、RRM2、SERPING1、FUCA1以及ID3的定量表达数据。数据集可以进一步包括CLDND1、MUC1、GOPC以及LEF1中的一个或多个的定量表达数据。The gene signatures described herein can be used in a computer-implemented method for assessing a sample obtained from a subject. Specifically, a data set associated with the sample can be obtained, and the data set can include quantitative expression data for LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17 for a core gene signature. A score can be generated based on the received data set, wherein the score indicates the predicted smoking status of the subject. Specifically, the score can be based on a classifier constructed using the LDA model method described herein. The data set can further include quantitative expression data for other markers included in the extended gene signature, IGJ, RRM2, SERPING1, FUCA1, and ID3. The data set can further include quantitative expression data for one or more of CLDND1, MUC1, GOPC, and LEF1.

在一些实施例中，数据集包括标志物集合LRRN3、CDKN1C、PALLD、SASH1、RGL1、TNFRSF17、IGJ、RRM2、SERPING1、FUCA1、ID3、CLDND1、MUC1、GOPC以及LEF1的任何数量的任何子集。可以向标签中所包括的标志物应用一个或多个准则，所述标志物如包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17中的至少三个(或任何其它合适的数量)，IGJ、RRM2、SERPING1、FUCA1以及ID3中的至少两个(或任何其它合适的数量)，以及CLDND1、MUC1、GOPC以及LEF1中的至少一个(或任何其它合适的数量)。一般来说，可以在不脱离本公开内容的范围的情况下使用任何使用这些标志物的组合的标签。In some embodiments, the data set includes any subset of any number of the marker set LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, ID3, CLDND1, MUC1, GOPC, and LEF1. One or more criteria can be applied to the markers included in the signature, such as at least three (or any other suitable number) of LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17, at least two (or any other suitable number) of IGJ, RRM2, SERPING1, FUCA1, and ID3, and at least one (or any other suitable number) of CLDND1, MUC1, GOPC, and LEF1. In general, any signature using a combination of these markers can be used without departing from the scope of the present disclosure.

在一些实施例中，本文所述的标签中的基因用于组装用来预测个体的吸烟者状况的试剂盒。具体来说，所述试剂盒包括一组检测测试样品中基因标签中的基因的表达水平的试剂，和使用所述试剂盒预测个体的吸烟者状况的说明书。所述试剂盒可以用于评定戒烟或吸烟产品的替代物(如HTP)对个体的影响。In some embodiments, the genes in the signature described herein are used to assemble a kit for predicting an individual's smoker status. Specifically, the kit includes a set of reagents for detecting the expression levels of genes in the gene signature in a test sample, and instructions for using the kit to predict an individual's smoker status. The kit can be used to assess the effects of smoking cessation or alternatives to smoking products (such as HTPs) on an individual.

图3是用于执行本文所述方法(如关于图1和2所述的方法)中的任一种或用于存储本文所述的核心基因标签、扩展基因标签或任何其它基因标签的计算设备的框图。具体来说，存储在计算机可读介质上的基因标签包括LRRN3、CDKN1C、PALLD、SASH1、RGL1以及TNFRSF17的表达数据。在另一个实例中，计算机可读介质所包括的基因标签包括至少五个标志物的表达数据，所述至少五个标志物选自由以下组成的群组：LRRN3、CDKN1C、PALLD、SASH1、RGL1、TNFRSF17、IGJ、RRM2、SERPING1、FUCA1以及ID3。Figure 3 is a block diagram of a computing device for performing any of the methods described herein (such as those described with respect to Figures 1 and 2) or for storing a core gene signature, an extended gene signature, or any other gene signature described herein. Specifically, the gene signature stored on a computer-readable medium includes expression data for LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17. In another example, the gene signature included in the computer-readable medium includes expression data for at least five markers, each selected from the group consisting of LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, and ID3.

在某些具体实施中，可以在若干计算设备300中实施部件和数据库。计算设备300包括至少一个通信接口单元、输入/输出控制器310、系统存储器、以及一个或多个数据存储设备。系统存储器包括至少一个随机存取存储器(RAM 302)和至少一个只读存储器(ROM304)。这些元件全部与中央处理单元(CPU 306)连通，以有利于计算设备300的运作。计算设备300可以以许多不同的方式配置。例如，计算设备300可以是常规的独立式计算机，或可选地，计算设备300的功能可以被分布在多个计算机系统和架构中。计算设备300可被配置成执行建模、评分和聚合操作中的一部分或全部。在图3中，计算设备300经由网络或局部网络连接到其它服务器或系统。In some specific implementations, components and databases can be implemented in several computing devices 300. Computing device 300 includes at least one communication interface unit, input/output controller 310, system memory, and one or more data storage devices. System memory includes at least one random access memory (RAM 302) and at least one read-only memory (ROM 304). These components are all communicated with a central processing unit (CPU 306) to facilitate the operation of computing device 300. Computing device 300 can be configured in many different ways. For example, computing device 300 can be a conventional stand-alone computer, or alternatively, the functions of computing device 300 can be distributed in multiple computer systems and architectures. Computing device 300 can be configured to perform part or all of modeling, scoring and aggregation operations. In Figure 3, computing device 300 is connected to other servers or systems via a network or local network.

计算设备300可以被配置成分布式架构，其中，数据库和处理器被容纳在单独的单元或位置中。一些这样的单元执行主要的处理功能，并且至少包含通用控制器或处理器和系统存储器。在这样的方面，这些单元中的每一个经由通信接口单元308附接到通信集线器或端口(未示出)，所述集线器或端口用作与其它服务器、客户端或用户计算机和其它相关设备的主要通信链路。通信集线器或端口自身可具有最低的处理能力，主要用作通信路由器。各种通信协议可以是系统的一部分，包括但不限于：Ethernet、SAP、SAS^TM、ATP、BLUETOOTH^TM、GSM和TCP/IP。Computing device 300 can be configured to distributed architecture, wherein, database and processor are contained in separate unit or position.Some such units perform main processing functions, and at least comprise general purpose controller or processor and system memory.In such respect, each in these units is attached to communication hub or port (not shown) via communication interface unit 308, and described hub or port are used as the main communication link with other server, client or user computer and other relevant equipment.Communication hub or port itself can have minimum processing power, are mainly used as communication router.Various communication protocols can be the part of system, include but not limited to: Ethernet, SAP, SAS ^TM , ATP, BLUETOOTH ^TM , GSM and TCP/IP.

CPU 306包括处理器，例如，一个或多个常规的微处理器和用于从CPU 306卸载工作量的诸如数学协处理器的一个或多个辅助的协处理器。CPU 306与通信接口单308和输入/输出控制器310通信，CPU 306通过通信接口单元308和输入/输出控制器310与诸如其它服务器、用户终端或设备的其它设备通信。通信接口单元308和输入/输出控制器310可包括多个通信信道，以用于与例如其它处理器、服务器或客户终端同时通信。彼此通信的设备不需要连续地发送到彼此。相反，这样的设备仅需要在必要时发送到彼此，实际上可以在大部分时间抑制交换数据，并且可能需要执行若干步骤以在设备之间建立通信链路。The CPU 306 includes a processor, such as one or more conventional microprocessors and one or more auxiliary coprocessors, such as a math coprocessor, for offloading the workload from the CPU 306. The CPU 306 communicates with a communication interface unit 308 and an input/output controller 310, and the CPU 306 communicates with other devices, such as other servers, user terminals, or devices, through the communication interface unit 308 and the input/output controller 310. The communication interface unit 308 and the input/output controller 310 may include multiple communication channels for communicating with, for example, other processors, servers, or client terminals simultaneously. Devices that communicate with each other do not need to continuously transmit to each other. Instead, such devices only need to transmit to each other when necessary, and in fact can refrain from exchanging data most of the time, and may need to perform several steps to establish a communication link between the devices.

CPU 306也与数据存储设备通信。数据存储设备可包括磁性、光学或半导体存储器的适当组合，并且可包括例如RAM 302、ROM 304、闪存驱动器、诸如压缩盘的光盘或硬盘或硬盘或驱动器。CPU 306和数据存储设备均可以例如完全位于单个计算机或其它计算设备内；或由通信介质连接到彼此，通信介质为例如USB端口、串行端口电缆、同轴电缆、以太网式电缆、电话线、射频收发器或其它类似的无线或有线介质、或上述的组合。例如，CPU 306可以经由通信接口单元308连接到数据存储设备。CPU 306可被配置成执行一个或多个特定的处理功能。The CPU 306 also communicates with a data storage device. The data storage device may include a suitable combination of magnetic, optical, or semiconductor memory and may include, for example, RAM 302, ROM 304, a flash drive, an optical disk such as a compact disk, or a hard disk or drive. The CPU 306 and the data storage device may each be, for example, completely located within a single computer or other computing device; or they may be connected to each other by a communication medium such as a USB port, a serial port cable, a coaxial cable, an Ethernet cable, a telephone line, a radio frequency transceiver, or other similar wireless or wired medium, or a combination thereof. For example, the CPU 306 may be connected to the data storage device via a communication interface unit 308. The CPU 306 may be configured to perform one or more specific processing functions.

数据存储设备可以存储例如：(i)用于计算设备300的操作系统312；(ii)一个或多个应用程序314(例如，计算机程序代码或计算机程序产品)，其适于根据本文所述系统和方法并且特别地根据关于CPU 306详细描述的过程来指导CPU 306；或者(iii)适于存储信息的数据库316，其可以用来存储程序所需的信息。在一些方面，数据库包括存储实验数据和公布的文献模型的数据库。The data storage device may store, for example: (i) an operating system 312 for the computing device 300; (ii) one or more application programs 314 (e.g., computer program code or computer program product) adapted to direct the CPU 306 according to the systems and methods described herein, and in particular according to the procedures described in detail with respect to the CPU 306; or (iii) a database 316 adapted to store information, which may be used to store information required by the program. In some aspects, the database includes a database storing experimental data and published literature models.

操作系统312和应用程序314可以例如存储成压缩、未编译和加密的格式，并且可包括计算机程序代码。程序的指令可以从计算机可读介质而不是数据存储设备(例如，从ROM 304或从RAM 302)读入处理器的主存储器中。虽然在程序中的指令的序列的执行造成CPU 306执行本文所述过程步骤，但硬连线电路可以用来代替软件指令或与软件指令结合使用，以实现本公开内容的过程。因此，所描述的系统和方法不限于硬件和软件的任何具体组合。Operating system 312 and application programs 314 may be stored, for example, in a compressed, uncompiled, and encrypted format and may include computer program code. The instructions of the program may be read into the processor's main memory from a computer-readable medium rather than a data storage device (e.g., from ROM 304 or from RAM 302). Although execution of the sequence of instructions in the program causes CPU 306 to perform the process steps described herein, hard-wired circuitry may be used in place of or in combination with software instructions to implement the processes of the present disclosure. Thus, the described systems and methods are not limited to any specific combination of hardware and software.

可以提供合适的计算机程序代码来执行本文所述的一个或多个功能。程序也可包括程序元素，例如，操作系统312、数据库管理系统和“设备驱动程序”，这些程序元素允许处理器经由输入/输出控制器310与计算机外围设备(例如，视频显示器、键盘、计算机鼠标等)进行交互。Suitable computer program code may be provided to perform one or more of the functions described herein. The program may also include program elements such as an operating system 312, a database management system, and "device drivers" that allow the processor to interact with computer peripherals (e.g., a video display, keyboard, computer mouse, etc.) via the input/output controller 310.

如本文所用，术语“计算机可读介质”是指任何非暂时性介质，其提供或参与提供指令给计算设备300的处理器(或本文所述设备的任何其它处理器)以执行。这样的介质可以采取许多形式，包括但不限于非易失性介质和易失性介质。非易失性介质包括例如光学、磁性、或光磁性盘、或诸如闪存存储器的集成电路存储器。易失性介质包括动态随机存取存储器(dynamic random access memory，DRAM)，其通常构成主存储器。常见形式的计算机可读介质包括例如软盘、软磁盘、硬盘、磁带、任何其它磁性介质、CD-ROM、DVD、任何其它光学介质、穿孔卡、纸带、带有孔图案的任何其它物理介质、RAM、PROM、EPROM或EEPROM(电可擦除可编程只读存储器)、FLASH-EEPROM、任何其它存储芯片或盒、或计算机可从其读取的任何其它非暂时性介质。As used herein, the term "computer-readable medium" refers to any non-transitory medium that provides or participates in providing instructions to a processor of the computing device 300 (or any other processor of the devices described herein) for execution. Such media can take many forms, including but not limited to non-volatile media and volatile media. Non-volatile media include, for example, optical, magnetic, or magneto-optical disks, or integrated circuit memories such as flash memory. Volatile media include dynamic random access memory (DRAM), which typically constitutes main memory. Common forms of computer-readable media include, for example, floppy disks, diskettes, hard disks, magnetic tape, any other magnetic media, CD-ROMs, DVDs, any other optical media, punch cards, paper tape, any other physical media with a pattern of holes, RAM, PROM, EPROM or EEPROM (Electrically Erasable Programmable Read-Only Memory), FLASH-EEPROM, any other memory chip or cartridge, or any other non-transitory medium from which a computer can read.

各种形式的计算机可读介质可以参与将一个或多个指令的一个或多个序列传输到CPU 306(或本文所述设备的任何其它处理器)以用于执行。例如，指令可以初始地承载在远程计算机(未示出)的磁盘上。远程计算机可以将指令加载到其动态存储器中，并且通过以太网连接、电缆线路或甚至使用调制解调器的电话线发送指令。计算设备300(例如，服务器)本地的通信设备可以在相应的通信线路上接收数据，并且将数据置于用于处理器的系统总线上。系统总线将数据传输到主存储器，处理器从主存储器获取并执行指令。由主存储器接收的指令可以可选地在由处理器执行之前或之后存储在存储器中。此外，指令可以经由通信端口被接收为电信号、电磁信号或光信号，这些信号是载送各种类型的信息的无线通信或数据流的示例性形式。Various forms of computer-readable media can be involved in transmitting one or more sequences of one or more instructions to CPU 306 (or any other processor of the device described herein) for execution. For example, the instructions can be initially carried on a disk of a remote computer (not shown). The remote computer can load the instructions into its dynamic memory and send the instructions via an Ethernet connection, a cable line, or even a telephone line using a modem. A communication device local to the computing device 300 (e.g., a server) can receive data on a corresponding communication line and place the data on a system bus for the processor. The system bus transfers the data to the main memory, from which the processor retrieves and executes the instructions. The instructions received by the main memory can optionally be stored in a memory before or after being executed by the processor. In addition, the instructions can be received as electrical signals, electromagnetic signals, or optical signals via a communication port, which are exemplary forms of wireless communication or data streams that carry various types of information.

本文引用的每篇参考文献均以引用方式全文并入本文中。Each reference cited herein is hereby incorporated by reference in its entirety.

虽然已结合具体实例特别地示出和描述了本公开内容的具体实施，但本领域的技术人员应当理解，在不脱离由所附权利要求限定的本公开内容的范围的情况下，可以对这些具体实施做出形式和细节上的各种更改。因此，本公开内容的范围由所附权利要求表示，并且落入权利要求书的等同物的涵义和范围内的所有更改因此都旨在被涵盖。While specific implementations of the present disclosure have been particularly shown and described with reference to specific examples, it will be understood by those skilled in the art that various changes in form and details may be made to these specific implementations without departing from the scope of the present disclosure as defined by the appended claims. The scope of the present disclosure is therefore indicated by the appended claims, and all changes that come within the meaning and range of equivalents of the claims are therefore intended to be embraced therein.

Claims

1. A computer-implemented method for evaluating samples obtained from subjects, comprising:

A dataset associated with the sample is received via a receiving circuit, the dataset including quantitative expression data of gene tags containing LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17; and

A classifier is applied to the quantitative expression data of the gene tag, and a score is generated by the processor based on the quantitative expression data of the gene tag, wherein the classifier is trained based on a linear discriminant analysis (LDA) model, the LDA model is calculated based on the quantitative expression data of the gene tag, and the score indicates the predicted smoking status of the subject.

2. The computer-implemented method according to claim 1, wherein the quantitative expression data of the gene tag further includes IGJ, RRM2, SERPING1, FUCA1, and ID3.

3. The computer-implemented method according to claim 1 or 2, further comprising calculating the fold change value of each of LRRN3, CDKN1C, PALLD, SASH1, RGL1 and TNFRSF17 in the gene tag.

4. The computer-implemented method of claim 3, further comprising determining that each calculated multiple change value exceeds a predetermined threshold.

5. The computer-implemented method according to claim 1, wherein the LDA model is calculated based on k-fold cross-validation of quantitative expression data of the gene tag.

6. A kit for predicting an individual's smoking status, comprising:

A set of reagents for detecting the expression levels of genes in gene tags, including LRRN3, CDKN1C, PALLD, SASH1, RGL1, and TNFRSF17, in test samples; and

Instructions for using the kit to predict an individual's smoking status, the instructions including applying a classifier to the detected expression levels of genes in the gene tag, wherein the classifier is trained based on a linear discriminant analysis (LDA) model calculated based on quantitative expression data of the gene tag.

7. The kit according to claim 6, wherein the kit is used to assess the effects of substitutes for smoking products on an individual.

8. The kit according to claim 7, wherein the alternative to the smoking product is a heated tobacco product.

9. The kit according to claim 8, wherein a decrease in LRRN3 expression was detected between 0 and 5 days after the individual began using the heated tobacco product.

10. The kit according to any one of claims 7 to 9, wherein the effect of the substitute on the individual is to classify the individual as a nonsmoker.

11. The kit according to any one of claims 6 to 9, wherein the gene tag further comprises at least one of IGJ, RRM2, SERPING1, FUCA1, and ID3.

12. A kit for predicting an individual's smoking status, comprising:

A set of reagents for detecting the expression levels of genes in a gene tag in a test sample, said gene tag comprising up to fifteen biomarkers, wherein at least five biomarkers in said gene tag are selected from the group consisting of: LRRN3, CDKN1C, PALLD, SASH1, RGL1, TNFRSF17, IGJ, RRM2, SERPING1, FUCA1, and ID3; and

13. The kit of claim 12, wherein the kit is used to assess the effects of a substitute for a smoking product on an individual, wherein the substitute for the smoking product is a heated tobacco product.

14. The kit according to claim 13, wherein a decrease in LRRN3 expression was detected between 0 and 5 days after the individual began using the heated tobacco product.

15. A computer-readable medium comprising computer-readable instructions that, when executed in a computerized system including at least one processor, cause the processor to perform one or more steps of the method according to any one of claims 1 to 5.