[go: up one dir, main page]

CN112164424A - Population evolution analysis method based on non-reference genome - Google Patents

Population evolution analysis method based on non-reference genome Download PDF

Info

Publication number
CN112164424A
CN112164424A CN202010768331.5A CN202010768331A CN112164424A CN 112164424 A CN112164424 A CN 112164424A CN 202010768331 A CN202010768331 A CN 202010768331A CN 112164424 A CN112164424 A CN 112164424A
Authority
CN
China
Prior art keywords
population
data
sample
information
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010768331.5A
Other languages
Chinese (zh)
Other versions
CN112164424B (en
Inventor
刘书云
张海焕
姜丽荣
孙子奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Personal Gene Technology Co ltd
Original Assignee
Nanjing Personal Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Personal Gene Technology Co ltd filed Critical Nanjing Personal Gene Technology Co ltd
Priority to CN202010768331.5A priority Critical patent/CN112164424B/en
Publication of CN112164424A publication Critical patent/CN112164424A/en
Application granted granted Critical
Publication of CN112164424B publication Critical patent/CN112164424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a 2d-RAD sequencing-based population evolution analysis method without a reference genome, which comprises the steps of carrying out data splitting on a sample, filtering and clustering to obtain population SNP, carrying out population genetic parameter analysis based on sample grouping and population SNP information, constructing a phylogenetic tree, determining an optimal K value, then utilizing an R self-writing script, and searching common and specific SNP information between two groups according to the population SNP information and the designated population information to carry out population evolution analysis without the reference genome. The whole data analysis of the invention is more automatic, which saves the labor cost, improves the analysis efficiency, avoids the possible human error and makes the analyzed data chart more beautiful.

Description

Population evolution analysis method based on non-reference genome
Technical Field
The invention relates to the technical field of gene sequencing analysis, in particular to a population evolution analysis method based on a non-reference genome.
Background
Group structural differences and gene communication conditions among different subgroups in the same species can be further researched through group evolution analysis, and group structural characteristics among different species can also be researched; however, many species have not yet published reference genomes, so the population evolution analysis without reference genomes is required.
Because there are multiple non-reference library building methods (RAD, GBS, 2d-RAD, SLAF, etc.), different library building methods differ in the first-step data splitting of non-reference analysis, while the existing non-reference analysis method based on 2d-RAD has complex data filtering process and low efficiency, especially when the number of items is large and the number of samples contained in one item is large, one item may be subjected to multiple on-machine sequencing in the actual operation process, so that different batches of data are obtained, and the existing non-reference analysis method cannot intelligently use an automated process to merge data of different batches and perform filtering, so that data merging and filtering consume a lot of manual time.
With the continuous development of high-throughput sequencing, the analysis content of the existing analysis process is thin, the analysis content is less, and the new analysis content without reference is more diversified and personalized. In the former non-parameter analysis process, a plurality of places need to be operated manually, and the existing novel non-parameter analysis method is more automatic, so that the automatic process improves the service efficiency of a server, reduces the analysis pressure of an analyst and is convenient for controlling the analysis content.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, the present invention aims to provide an automated analysis method for population evolution analysis based on a reference-free genome.
In order to realize the purpose of the invention, the adopted technical scheme is as follows:
a population evolution analysis method based on a reference-free genome after 2d-RAD sequencing comprises the following steps:
the first step is as follows: carrying out data splitting by utilizing a splitting script according to the enzyme cutting site information of barcode, enzyme 1 and enzyme 2 in a sequencing sample, merging a plurality of off-line sequencing data of the same sample, and storing the merged data in a first folder in a fastq.gz format;
the second step is that: performing quality control on the data subjected to the splitting and merging in the first step through a filtering script, and then performing quality control on the data according to the base quality value: performing data filtering on the standard with the Q being more than or equal to 20 and the sequence length being more than or equal to 50bp to obtain filtered data, and storing the filtered data in a second folder in a fastq.gz format;
the third step: firstly, carrying out sequence clustering in a single sample, merging double-end sequencing data of the single sample into a file before clustering, then carrying out clustering by utilizing an ustacks command in software Stacks to obtain a representative catalog sequence of each sample, and storing a result file in a third folder in a tag s.tsv.gz format;
the fourth step: grouping the samples, and clustering based on the catalog sequence of a single sample to obtain consensus sequences of all the samples, wherein the consensus sequences are similar reference genome sequences for all the samples;
the fifth step: reading grouping information of each sample designated by all files, designating deletion rate parameters, detecting group SNP information by utilizing a cstags command in software Stacks, and storing the group SNP information in a format of a VCF file;
and a sixth step: analyzing population genetic parameters by utilizing a population command in Stacks based on the population SNP information in the fifth step, and calculating to obtain a population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity, observation heterozygosity and haplotype diversity data;
the seventh step: carrying out format conversion on the VCF file of the group SNP information in the fifth step by using software vcftools and plink, then carrying out dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components which have large influence on the group and calculate the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script;
eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by using a Python self-writing script, and then constructing a phylogenetic tree by using different models;
the ninth step:
converting the SNP format of the group into a format required by a software structure by using a Perl self-writing script, then specifying the number of SNPs and the number of the group used in analysis, and calculating the percentage of each sample belonging to a specified ancestor;
then determining the optimal K value (the number of ancestors), and obtaining whether the grouping information of the sample is consistent with that originally specified or not through the result;
the tenth step:
common and unique SNP information between two groups is searched according to the Group SNP information and the designated Group information by using Perl self-writing script.
In a preferred embodiment of the present invention, the filtering script is filter _ batch _ v2. pl.
In a preferred embodiment of the present invention, the model for constructing the phylogenetic tree includes any one or more of Maximum Pathology (MP), neighbor-join (NJ), Maximum Likeliood (ML) or Bayesian method (BI).
In a preferred embodiment of the present invention, the optimal K value is a K value corresponding to a post-plateau inflection point of the ln likelihood entry.
The invention has the beneficial effects that:
the whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, possible human errors are avoided, and the analyzed data chart is more attractive.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a PCA profile of the present invention.
FIG. 3 is a NJ model-based distribution diagram of the evolutionary tree of the present invention.
FIG. 4 is the distribution diagram of the population genetic structure when the optimal K is 3.
Detailed Description
The principle of the invention is as follows:
based on the 2d-RAD parametrically simplified automatic filtering process, batch data splitting and filtering can be performed, all subsequent analysis items of data filtering can be automatically completed, the data processing efficiency and the server use efficiency are improved, the labor time is saved, meanwhile, human errors are reduced, the whole project analysis period is finally shortened, and the parametrically-free analysis high-efficiency automation of rich analysis content is realized.
With reference to fig. 1, the population evolution analysis method based on the reference-free genome after 2d-RAD sequencing of the present invention comprises the following steps:
(1) data splitting step
Carrying out automatic data splitting by using a self-written script according to the information of the barcode, enzyme 1 and enzyme 2 enzyme cutting sites of the sequenced sample, wherein the format is approximately one line of information representing one sample, and elements in each column are respectively the name of the sample, the barcode base, the enzyme cutting site of the enzyme 1 and the enzyme cutting site of the enzyme 2, wherein the spacer is set as a tab; if one sample has a plurality of times of off-machine sequencing, the analysis process can be automatically matched and merged, and the merged data is uniformly stored in a folder of 1_ RawData in a format of fastq.
The split script specifically comprises:
a library contains a plurality of samples, the four columns of sample names, barcode, enzyme 1 and enzyme cutting site sequences of enzyme 2 are used as input files 1, and the original double-end data fastq.gz of the library is used as input files 2 and 3;
if the front 7bp of the 5 'end of R1 of a sequence is consistent with barcode, the next 5 bases are consistent with the enzyme cutting site of enzyme 1, and the front 4bp of the 5' end of R2 corresponding to reads is consistent with the enzyme cutting site sequence of enzyme 2, the reads is split into the sample, and the final splitting data result of each sample is output after multiple cycles.
(2) Data quality control and filtering step
And performing quality control on the fastQC of the sample by using the self-written automatic filtering script filter _ batch _ v2.pl, and simultaneously performing data filtering according to the standards of the base quality value (Q is more than or equal to 20) and the sequence length (more than or equal to 50 bp). After the end of the run, all high quality data is stored in 2_ HQData in fastq.gz format.
The filter script is filter _ batch _ v2. pl:
the script firstly reads a double-ended sequence file $ { name } _ R1.fastq.gz and $ { name } _ R1.fastq.gz of sample off-machine original data in 1_ RawData as an input file, then renames the file, performs quality control on the input file through software fastqc, and obtains a fastq file of information such as base quality of the original data;
then using a software Adapter Removal to take a fastq.gz file of original data as an input file, removing a sequencing joint, simultaneously saving a newly generated result file in a fastq format in 2_ HQData, then taking the newly generated fastq file in the last step as an input file of a sequence quality filtering program, performing quality filtering by adopting a sliding window method, setting the window size to be 5bp, and setting the step length to be 1 bp;
moving one base forward every time, taking 5 bases to calculate the average Q value of a window, and if the average Q value of the window is less than or equal to 20, only keeping the last base and the previous base of the window;
and then removing the reads at the double ends if the length of any one of the reads at the double ends is less than or equal to 50 bp. The final result is output as $ { name } _ HQ _ R1.fq and $ { name } _ HQ _ R2. fq.
(3) Single sample internal sequence clustering step
Because the parametrically-free analysis has no reference genome, sequence clustering is firstly carried out in a single sample, the paired end sequencing data of the single sample is merged into a file before clustering, then clustering is carried out by utilizing the ustacks command in software Stacks to obtain a representative catalog sequence of each sample, and a result file is stored in a 3_ Stacks folder in a tag s.tsv.gz format.
(4) Catalog sequence clustering step of all samples
And specifying grouping information of the samples, and simultaneously clustering based on the catalog sequence of a single sample to obtain consensus sequences of all the samples, wherein the consensus sequences are used as reference genome sequences of all the samples.
(5) Step of detecting group SNP
Reading grouping information of all files for appointing each sample, appointing deletion rate parameters, detecting group SNP information by utilizing a cstags command in software Stacks, and storing the group SNP information in a format of a VCF file.
(6) Analysis of population genetic parameters (Fst, π, heterozygosity, haplotype diversity)
According to the SNP information of the population, population genetic parameters are analyzed by utilizing a posts command in Stacks, and a population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity, observed heterozygosity and haplotype diversity are calculated.
(7) Step of population PCA analysis
Performing format conversion by using software vcfttools and plink according to a VCF file of the group SNP, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components which have large influence on the group, calculating contribution values of the main components, and finally drawing a PCA distribution diagram by using an R self-writing script.
The R self-writing script firstly reads the vector information of the PC1 and the PC2 output by GCTA software as an input file, calculates the contribution rates of the PC1 and the PC2, and then draws a scatter diagram by using a ggplot2 packet in the R.
(8) Step of group phylogenetic tree analysis
And connecting the obtained group SNP information with the SNP information conversion format of each sample together by using the self-writing script, and then constructing a phylogenetic tree by using different models.
Common models for constructing evolutionary trees include Maximum Parsimony (MP), neighbor-join (NJ), Maximum Likelihood (ML), and Bayesian method (BI);
the MP model is suitable for long sequences without back mutation and parallel mutation of sites, high sequence similarity, large number of nucleotides or amino acids and stable substitution rate. The NJ model is suitable for short sequences with small evolutionary distance and few information sites. In the case of evolutionary model determination, the ML method is the tree building method that best fits the evolutionary facts. The BI model reserves the basic principle of a maximum likelihood method, introduces a Monte Carlo method of a Markov chain, and is suitable for deducing a system tree, evaluating the uncertainty of the system tree, detecting and selecting action, comparing the system tree, calculating divergence time by referring to fossil records and detecting a molecular clock.
(9) Step of population genetic Structure analysis
The self-written script converts the population SNP format into a format required by the software structure, then specifies the number of SNPs and the population number used in the analysis, and calculates the percentage of ancestors of each sample belonging to the specification. The best K value (number of ancestors) is then determined, and from this result, it is possible to find whether the clustering information of the sample is consistent with that originally specified.
The simulation result of each K value based on the Bayesian model calculation method can generate a corresponding maximum likelihood value (ln likelihood) which is output after a natural logarithm is taken. The larger the ln likelihood, the closer the K value is to the real case, but generally as the K value increases, the ln likelihood value also enters the plateau stage. The optimal K value is the K value corresponding to the inflection point of the plateau).
(10) Step of population-specific SNP analysis
The self-written script looks for common and unique SNP information between two groups based on the Group SNP information and the designated Group information.
Firstly, filtering original SNPs according to the situation of genotype deletion and the sequencing depth of SNP sites, wherein the specificity of the SNPs in a population is defined by two thresholds (A and B), wherein the frequency of the SNPs in a target population is higher than a certain threshold (A), the frequency of the SNPs in a non-target population is lower than a certain threshold (B), and the threshold is generally set to be 0.8.
Based on the steps, the invention has the advantages that:
(1) the whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, and possible human errors are avoided.
(2) The analysis content is richer, and the graph of the analysis result is more attractive (as shown in figures 2-4).

Claims (4)

1. A population evolution analysis method based on a reference-free genome after 2d-RAD sequencing is characterized by comprising the following steps:
the first step is as follows: carrying out data splitting by utilizing a splitting script according to the enzyme cutting site information of barcode, enzyme 1 and enzyme 2 in a sequencing sample, merging a plurality of off-line sequencing data of the same sample, and storing the merged data in a first folder in a fastq.gz format;
the second step is that: performing quality control on the data subjected to the splitting and merging in the first step through a filtering script, and then performing quality control on the data according to the base quality value: performing data filtering on the standard with the Q being more than or equal to 20 and the sequence length being more than or equal to 50bp to obtain filtered data, and storing the filtered data in a second folder in a fastq.gz format;
the third step: firstly, carrying out sequence clustering in a single sample, merging double-end sequencing data of the single sample into a file before clustering, then carrying out clustering by utilizing an ustacks command in software Stacks to obtain a representative catalog sequence of each sample, and storing a result file in a third folder in a tag s.tsv.gz format;
the fourth step: grouping the samples, and clustering based on the catalog sequence of a single sample to obtain consensus sequences of all the samples, wherein the consensus sequences are similar reference genome sequences for all the samples;
the fifth step: reading grouping information of each sample designated by all files, designating deletion rate parameters, detecting group SNP information by utilizing a cstags command in software Stacks, and storing the group SNP information in a format of a VCF file;
and a sixth step: analyzing population genetic parameters by utilizing a population command in Stacks based on the population SNP information in the fifth step, and calculating to obtain a population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity, observation heterozygosity and haplotype diversity data;
the seventh step: carrying out format conversion on the VCF file of the group SNP information in the fifth step by using software vcftools and plink, then carrying out dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components which have large influence on the group and calculate the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script;
eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by using a Perl self-writing script, and then constructing a phylogenetic tree by using different models;
the ninth step:
converting the group SNP format into a format required by a software structure by using a Python self-writing script, then specifying the number of SNPs and the number of groups used in analysis, and calculating the percentage of each sample belonging to a specified ancestor;
then determining the optimal K value (the number of ancestors), and obtaining whether the grouping information of the sample is consistent with that originally specified or not through the result;
the tenth step:
common and unique SNP information between two groups is searched according to the Group SNP information and the designated Group information by using Perl self-writing script.
2. The method for population evolution analysis based on 2d-RAD sequenced non-reference genome according to claim 1, wherein the filter script is filter _ batch _ v2. pl.
3. The method for population evolution analysis based on no reference genome after 2d-RAD sequencing according to claim 1, wherein the model for constructing the phylogenetic tree comprises any one or more of Maximum Parsimony (MP), neighbor-join (NJ), Maximum Likelihood (ML) or Bayesian method (BI).
4. The method of claim 1, wherein the optimal K value is a K value corresponding to an inflection point after an ln likelihood entry plateau.
CN202010768331.5A 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome Active CN112164424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010768331.5A CN112164424B (en) 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010768331.5A CN112164424B (en) 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome

Publications (2)

Publication Number Publication Date
CN112164424A true CN112164424A (en) 2021-01-01
CN112164424B CN112164424B (en) 2024-04-09

Family

ID=73859973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010768331.5A Active CN112164424B (en) 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome

Country Status (1)

Country Link
CN (1) CN112164424B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113678767A (en) * 2021-08-10 2021-11-23 中国水产科学研究院黄海水产研究所 Breeding method for prawn disease resistance character

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7571151B1 (en) * 2005-12-15 2009-08-04 Gneiss Software, Inc. Data analysis tool for analyzing data stored in multiple text files
CN101968774A (en) * 2010-10-21 2011-02-09 中国人民解放军61938部队 Device and method for storing mobile data safely
GB201404479D0 (en) * 2013-03-15 2014-04-30 Palantir Technologies Inc Transformation of data items from data sources using a transformation script
CN104573409A (en) * 2015-01-04 2015-04-29 杭州和壹基因科技有限公司 Gene mapping multi-inspection method
CN105002567A (en) * 2015-06-30 2015-10-28 北京百迈客生物科技有限公司 Construction method of high-throughput simplified methylation sequencing library without reference genome
CN108388771A (en) * 2018-01-24 2018-08-10 安徽微分基因科技有限公司 A kind of bio-diversity automatic analysis method
US20180260521A1 (en) * 2017-03-09 2018-09-13 Shine Biopharma Inc. Method and apparatus for multiple dot plot analysis
CN108537006A (en) * 2018-04-03 2018-09-14 郑州云海信息技术有限公司 A kind of gene sequence data processing method, apparatus and system
CN109182505A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs75762330 and 2b-RAD Genotyping and analysis method
WO2019191649A1 (en) * 2018-03-29 2019-10-03 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
CN111235303A (en) * 2020-03-24 2020-06-05 中国环境科学研究院 Method for identifying cord-grass and spartina alterniflora
CN111276185A (en) * 2020-02-18 2020-06-12 上海桑格信息技术有限公司 Microorganism identification and analysis system and device based on second-generation high-throughput sequencing

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7571151B1 (en) * 2005-12-15 2009-08-04 Gneiss Software, Inc. Data analysis tool for analyzing data stored in multiple text files
CN101968774A (en) * 2010-10-21 2011-02-09 中国人民解放军61938部队 Device and method for storing mobile data safely
GB201404479D0 (en) * 2013-03-15 2014-04-30 Palantir Technologies Inc Transformation of data items from data sources using a transformation script
CN104573409A (en) * 2015-01-04 2015-04-29 杭州和壹基因科技有限公司 Gene mapping multi-inspection method
CN105002567A (en) * 2015-06-30 2015-10-28 北京百迈客生物科技有限公司 Construction method of high-throughput simplified methylation sequencing library without reference genome
US20180260521A1 (en) * 2017-03-09 2018-09-13 Shine Biopharma Inc. Method and apparatus for multiple dot plot analysis
CN108388771A (en) * 2018-01-24 2018-08-10 安徽微分基因科技有限公司 A kind of bio-diversity automatic analysis method
WO2019191649A1 (en) * 2018-03-29 2019-10-03 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
US20210057046A1 (en) * 2018-03-29 2021-02-25 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
CN108537006A (en) * 2018-04-03 2018-09-14 郑州云海信息技术有限公司 A kind of gene sequence data processing method, apparatus and system
CN109182505A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs75762330 and 2b-RAD Genotyping and analysis method
CN111276185A (en) * 2020-02-18 2020-06-12 上海桑格信息技术有限公司 Microorganism identification and analysis system and device based on second-generation high-throughput sequencing
CN111235303A (en) * 2020-03-24 2020-06-05 中国环境科学研究院 Method for identifying cord-grass and spartina alterniflora

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JULIAN M. CATCHEN: "Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences", G3 GENES GENOMES GENETICS, vol. 1, pages 171 - 182, XP055071071, DOI: 10.1534/g3.111.000240 *
WEIXIN_34252686: "使用stacks分析RAD-seq", pages 1 - 40, Retrieved from the Internet <URL:HTTPS://blog.csdn.net/weixin_34252686/article/details/89628396> *
张珊珊;陈剑;杨文忠;: "应用简化基因组技术对富民枳遗传多样性检测", 东北林业大学学报, no. 04, 14 April 2020 (2020-04-14), pages 38 - 43 *
胡景杰 等: "RAD测序技术及其在水生生物研究中的应用", 水产科学, vol. 37, no. 1, pages 125 - 132 *
董树明, 徐文胜, 董逸生: "数据集成中的一种数据合并技术", 现代计算机, no. 11, 30 November 2003 (2003-11-30), pages 1 - 5 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113678767A (en) * 2021-08-10 2021-11-23 中国水产科学研究院黄海水产研究所 Breeding method for prawn disease resistance character
CN113678767B (en) * 2021-08-10 2022-08-23 中国水产科学研究院黄海水产研究所 Breeding method for prawn disease resistance character

Also Published As

Publication number Publication date
CN112164424B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
Dumschott et al. Oxford Nanopore sequencing: new opportunities for plant genomics?
Amarasinghe et al. Opportunities and challenges in long-read sequencing data analysis
Wee et al. The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing
US20200051663A1 (en) Systems and methods for analyzing nucleic acid sequences
US20170199959A1 (en) Genetic analysis systems and methods
Yue et al. Long-read sequencing data analysis for yeasts
CN105989249B (en) Methods, systems and devices for assembling genome sequences
WO2013043909A1 (en) Systems and methods for identifying sequence variation
AU2022298428A1 (en) Gene sequencing analysis method and apparatus, and storage medium and computer device
CN104484582B (en) The biological information project automatic analysis method and system realized by modularization selection
CN109559780A (en) A kind of RNA data processing method of high-flux sequence
US20150169823A1 (en) String graph assembly for polyploid genomes
CN105426700B (en) A kind of method that batch calculates genome ortholog evolutionary rate
Chen et al. Recent advances in sequence assembly: principles and applications
Liao et al. An efficient trimming algorithm based on multi-feature fusion scoring model for NGS data
CN112164424B (en) Group evolution analysis method based on no-reference genome
CN110570901B (en) Method and system for SSR typing based on sequencing data
CN111492436A (en) Rapid quality control of sequencing data using K-mers without alignment
US20210074382A1 (en) System and method for categorization of nucleic acid sequencing
CN119229969A (en) A method for accurately and efficiently evaluating gene editing efficiency based on NGS data, storage medium and electronic device
US20190050531A1 (en) Dna sequence processing method and device
CN117561573A (en) Automatic identification of the source of faults in nucleotide sequencing from base interpretation error patterns
CN110504007B (en) Working method and system for completing multi-scene strain identification in one-key mode
CN119479807B (en) Gene sequencing system and method for polygene repeated sequences
CN118430645B (en) A method for redefining whole-genome DNA data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant