Special Column on Breeding | 5 - Low Depth Resequencing of Genotyping

Frontier sharing

2022-10-21


Today, I will continue to talk with you
Using Resequencing
Another strategy for genotyping to reduce costs——
Whole genome low depth re sequencing
(low-coverage whole-genome sequencing, LcWGS)

 

The genotype detected by the method of high depth re sequencing is undoubtedly the most comprehensive, but at present, the cost of application in animal and plant breeding is too high, especially for those species with complex and huge genomes. As mentioned in the previous issue, researchers usually use a unique library construction method to carry out simplified genome sequencing (RAD seq), thereby reducing the cost of genotyping. However, the amount of simplified genome data is generally only 1~10% of the total genome, and a lot of information is still lost. Pool sequencing is also an effective way to reduce the cost of population research, but it cannot analyze individuals, which has little effect on animal and plant breeding.

The LcWGS strategy combines the advantages of RAD seq and Pool seq, and avoids their disadvantages, as shown in Figure 1. It can not only study the whole genome at the population level (taking into account the depth and breadth of the genome), but also retain individual information at the same cost. Therefore, it is a popular practice to obtain the whole genome genotype by combining LcWGS with the algorithm in recent years

 

 

Introduction to LcWGS

So, how low is the low depth sequencing of the whole genome? According to Xiao Bian, it is generally less than 5x, or even less than 1x, depending on the balance of sample quantity and sequencing depth under a given budget. LcWGS firstly conducts genome wide low depth re sequencing and mutation detection for all individuals in the population, and then uses the algorithm to infer and impute the missing genotype according to the linkage disequilibrium (LD) between mutations, and finally obtains high-density genetic markers at the genome level for large-scale samples.

In recent years, large samples of LcWGS have proved theoretically that they can obtain genome wide high-density SNP markers at a very low cost, thereby increasing the accuracy of QTL mapping and better mining the genetic mechanism of various diseases (Zan et al., 2019; Humburger et al., 2019). LcWGS is also used for association analysis (Cai et al., 2015) and population genetic research (Rustagi et al., 2017). The study found that the advantage of filling low-density data into the whole genome sequencing level for breeding value prediction highly depends on the frequency distribution of causative mutations. In the neutral model, the superiority of filling data is very small. When the frequency of all the smallest alleles causing mutation is very low, the accuracy of genetic evaluation using filling data can be improved by 30% (Druet et al., 2014).

LcWGS pretreatment process is similar to WGS, but an important difference is that genotype possibility is needed to explain the probability of genotype uncertainty, such as using site frequency spectrum (SFS) for downstream analysis (Figure 2).

 

 

Genotype filling

Genotype filling Genotype filling (or filling, or interpolation) is the process of predicting and filling the missing genotype according to the haplotype and genotype in the reference panel. It is based on the assumption that even two apparently unrelated individuals share fragments from a common ancestor in their genomes. In this way, the panel containing a large number of markers can be used to infer the genotype not observed in a sample, thus effectively increasing the SNP density (Figure 3). Genotype filling (or filling, or interpolation) is the process of predicting and filling the missing genotype according to the haplotype and genotype in the reference panel. It is based on the assumption that even two apparently unrelated individuals share fragments from a common ancestor in their genomes. In this way, the panel containing a large number of markers can be used to infer the genotype not observed in a sample, thus effectively increasing the SNP density (Figure 3).

 

 

In the field of genome methodology, human research is always ahead of animals and plants. At present, most of the LcWGS software and algorithms are also aimed at human genome development, and animal and plant genomes can be referred to. The difficulty of LcWGS lies in how to accurately infer and fill in individual genotypes. At present, most genotype filling software uses the framework of hidden Markov model to estimate haplotypes from the reference panel to infer genotypes.

Genotype filling can be divided into two types. One is based on the mutation file, which is more familiar to everyone. For example, the reference panel is used to populate the vcf, so that the number of variants is the same as the reference panel. The other is to directly type the samples based on LcWGS by using the bam file after comparison and the reference panel, and finally obtain the same number of bits as the reference panel (Figure 4). Because the mutation file vcf only contains the mutation sites of the detection population (which does not mean that there is no mutation in other populations), and LcWGS is filled with bam files, its reads cover the genome more widely, and take into account the phase information of reads1 and reads2 (which is important for filling), LcWGS has a better filling effect.

 

 

Picture description:

The high depth re sequencing data of high generation samples is used to determine the SNP reference data set. The low depth re sequencing data is filtered and compared with the reference genome to obtain the intermediate Bam file, which is then filled based on the highly variable sites (HCS). At the same time, the accuracy of the deep re sequencing data of random individuals is evaluated, and finally the SNP data set that can be used for genome breeding is obtained.

 

It can be seen from the above that the reference panel seems to be necessary for genotype filling. How to obtain the reference panel of a specific species? The human reference panel has been very comprehensive, such as 1000 Genomes, Haplotype Reference Consortium (HRC), etc., which will not be introduced here. In recent years, researchers in the field of animals and plants have also developed corresponding databases, which can be downloaded from the reference panel:

Animal InputDB( http://gong_lab.hzau.edu.cn/Animal_ImputeDB/# !/) It includes 2265 samples of 13 species.

Plant InputDB( http://gong_lab.hzau.edu.cn/Plant_imputeDB/# !/) 34244 samples of 12 species were included.
How to fill in the studied species if there is no reference panel? I think the first is that we can build our own, but the group materials need to be representative enough. Second, you can use software that does not require a reference panel, such as STITCH (Davies et al., 2016).

  

  

Common tools

There are many tools for LcWGS filling. Here are some representative examples. Students interested in software use and algorithm can communicate with Xiaobian.

ANGSD。 It should be one of the most used software (Korneliussen et al., 2014). As shown in the workflow in Figure 2, SFS shows the algorithm of ANGSD. For details, please refer to the overview: A beginner's guide to low overview whole genome sequencing for population genetics. The author of the article has also set up corresponding courses, and the information is available at Github: https://github.com/nt246/lcwgs-guide-tutorial 。

Meta-imputation。 It is not limited to a single reference panel, but constructs a combined reference panel based on a specific research group, allowing multiple filling results generated using different reference panels to be combined to generate a consistent filling dataset (Yu et al., 2022). The software has been developed recently and is currently only used for humans.

STITCH。 It also has great influence and was published on NG in 2016 (Davies et al., 2016). For example, the low depth re sequencing (0.06-0.1x) of 140,000 Chinese non-invasive prenatal testing (NIPT), using the BaseVar developed by Huada and STITCH filling, was published on Cell in 2018 (Liu et al., 2018). Similarly, the BaseVar Stitch process (Yang et al., 2021) is also adopted by Hu Xiaoxiang of China Agricultural University in conjunction with MGI in the LcWGS breeding process for Duroc boars, as shown in Figure 5:

 

 

Other common genotype filling software, such as Beagle, Input2, Shapeit2+Input2, MACH+Minimac3, will not be introduced here because they are not specifically used in LcWGS.

 

LcWGS features

Compared with several genotyping strategies introduced before Xiaobian, LcWGS has obvious advantages. The comparison of several technologies from different dimensions is as follows:

  LcWGS WGS Array RAD-seq
Sequencing depth low high -- high
Number of variants more more less less
New variant detection yes yes no no
Accuracy moderate high high high
Reference genome yes yes yes yes/no
Cost low high low low

 

Although LcWGS has many advantages, it still has the following shortcomings:

  • The process is relatively complex, lacking user-friendly software interface and documentation;
  • Phasing and filling are required, which requires high calculation;
  • The current software has some defects, which lead to inconsistent genotype interpretation;
  • It is not suitable to transfer the analysis of known genotypes, and is vulnerable to batch effect;
  • When there is no reference panel, the phase cannot be accurately determined (that is, it cannot be based on haplotype analysis);
  • It is not suitable for small sample size and complex genome.

In general, I believe that LcWGS is a new method worth exploring in the field of animal and plant breeding. In fact, as early as 2010, the first rice GWAS study (517 local rice varieties~1x re sequencing and filling) published by Mr. Han Bin was to use LcWGS, but the most classical and simple K-nearest neighbor algorithm (KNN) was used for filling (Huang et al., 2010). In terms of commercial breeding practice, Israel NRGene Company has made some attempts. However, how to embed LcWGS into the whole breeding plan and process through reasonable design experiments is still a major problem. Researchers need to set breeding goals, design systems according to species genomes and the number of breeding materials, adopt appropriate sequencing strategies, use excellent algorithms, control budgets, and find appropriate balance points. For how to optimize the specific experimental design, you can also refer to the simulation process( https://github.com/therkildsen-lab/lcwgs-simulation )。

 

This sharing is over. See you next time.

 


Related recommendations

Breeding column | 7 - Targeted sequencing of genotyping

Targeted sequencing is a method to isolate, enrich and sequence a group of target genes or genome regions. This method enables researchers to focus time, cost and data analysis on specific regions of interest (target regions, genes), and use less data to obtain higher sensitivity and accuracy, so as to achieve rapid screening of mutation sites. These target regions usually include exome (the protein coding part of the genome), specific genes of interest (customized content), and target regions in genes or mitochondrial DNA.

11-04

2022

Breeding column | 6-New genotyping technology: FBI seq

FBI seq (Foreground and Background Integrated genotyping by sequencing) is not sequencing by the US Federal Bureau of Investigation, but genotyping sequencing integrating foreground and background. As the name implies, this technology realizes the detection and selection of foreground genes and genetic background at the same time. The selection of foreground and background is two very important steps in molecular breeding. At present, breeders often need to carry out these two steps independently: first, screen foreground target sites, and then develop a large number of probes/bait/PCR primers to detect background genotypes. These time-consuming and costly preparations greatly delay the start of breeding projects.

10-28

2022

Invitation | S371 Biobin Data Sciences invites you to participate in the 4th China North Seed Industry Expo!

Deeply cultivate the agricultural core, and work together for the future. In response to the call of the country to build a "China core" of seed industry, and to promote the joint construction of the "Northern Seed City", Phoenix Expo was officially upgraded to the "China Northern Seed Expo" on the basis of successfully holding three sessions of China Shandong International Vegetable Seed Expo, focusing on new varieties and new technologies such as field, vegetables, fruits and vegetables, horticulture, balcony agriculture, and combining cloud seed industry, cloud plant protection, cloud agricultural science platform, Build online and offline promotion and trading dual platforms for seed breeding. The 2022 Northern Species Expo will be linked with the 29th Shandong Double Trade Fair for Plant Protection, China Shandong International New Fertilizer Exhibition, and China (Shandong) New Agricultural Equipment Exhibition!

10-25

2022

Special Column on Breeding | 5 - Low Depth Resequencing of Genotyping

The genotype detected by the method of high depth re sequencing is undoubtedly the most comprehensive, but at present, the cost of application in animal and plant breeding is too high, especially for those species with complex and huge genomes. As mentioned in the previous issue, researchers usually use a unique library construction method to carry out simplified genome sequencing (RAD seq), thereby reducing the cost of genotyping. However, the amount of simplified genome data is generally only 1~10% of the total genome, and a lot of information is still lost. Pool sequencing is also an effective way to reduce the cost of population research, but it cannot analyze individuals, which has little effect on animal and plant breeding.

10-21

2022

Accelerate the application of artificial intelligence and lead the precision breeding BIOBIN

On October 13, the agricultural industry observed and planned a series of future agricultural activities. This activity focused on the theme of "the core future, the road to commercialization of biological breeding". It was hoped that through the interpretation and sharing of the breeding innovation layout of breeding companies and agricultural science and technology companies, new trends, industrialization and commercialization of biological breeding, the new picture, new business and new models of biological breeding would be revealed, and the possibility of future seed industry would be jointly looked forward to. Wu Xin, chief technology officer of Biocloud, participated in this activity and shared the theme of "accelerating the application of artificial intelligence and leading precision breeding". The following is a summary of the essence of the speech.

10-18

2022

Special Column on Breeding | Solid Phase Chip for 4-Genotyping

The layman who hears "gene chip" for the first time can easily connect it with the electronic chip of industrial integrated circuit. In fact, except that they all use micro technology to make the appearance more similar, they have nothing to do with each other and are purely porcelain. Gene chip, also known as DNA chip, biochip or DNA microarray, is based on the principle of specific interaction between molecules, integrating discontinuous analysis processes on the surface of solid phase chips such as silicon or glass, to achieve accurate, rapid and large detection of cells, proteins, genes and other biological components. According to specific scientific research and application contents, gene chips can be subdivided into microarray comparative genomic hybridization (a-CGH) chips, microRNA chips, SNP chips, expression profile chips, DNA methylation chips and chromatin immunoprecipitation chips.

10-07

2022