Phylogenomic_Tutorial_IntrogressionDetection_with_SNP

[TOC]

Github/mmatschiner的phylogenetic & phylogenomic学习教程记录【十一】:利用SNP数据鉴定渐渗introgression和杂交hybridization。对于物种树尺度上的进化历史来说,种间杂交、基因流的交流是会发生的,并且这种现象也会对二叉分枝的拓扑结构产生影响。本篇教程会利用ABBA-BABA related的Patterson’s D statistics来鉴定种间的渐渗信号。以及sliding window scan在各个染色体片段上的渐渗信号。

Preparation

Softwares:

1. 利用模拟数据推断物种树species-tree和基因流gene-flow

1.1 利用msprime模拟Phylogenomic数据

用常规数据来测试不同phylo方法的一大困难就是不好比较优劣,这时候可以利用simulated模拟测试数据。Coalescent-based msprime方法可较好的模拟出群体数据。可以利用pypopgen3中的simulate方法模拟测试数据。

教程所用的数据为20个species,每个species2个individuals,总共40个样本。物种分化于1ma作用,population size为50000;重组率和突变率为1e-8, 20Mb。教程提供了没有gene-flow和存在gene-flow的VCF文件。

1.2 对测试数据来构建系统树

我们利用前面教程中的PAUP软件方法来构建系统树。大致通过vcf2phylip转换vcf为nex格式文件,利用PAUP软件SVDQuartets,选择外类群,快速构建NJ树。
without_geneflow

with_geneflow

利用此方法对测试数据中的without gene flow,以及存在gene flow的数据进行快速建树,可以看见S14的拓扑位置在更外面,可能的解释是S14接受到了来自于S00的introgression导致的。

1.3 测试introgression引起的gene-flow

ABBA-BABA的基本原理,见之前的教程文章。几个关键点就是 在一个确定的系统发育框架下,存在ILS的现象中,找外类群为A,P3为固定下来的B位点的allel。
ABBA-BABA

利用的软件为Dsuite,可计算多种introgression相关的统计值:D, f4-ratio, fdm, f-branch。直接用VCF所计算。可参考原始文章Dsuite Paper

  1. 准备sample-species对应文件

    • 文件为sample_id species_id
    • 外类群的species_id需要指明为outgroup
      1
      bcftools query -l chr1_no_geneflow.vcf.gz | awk '{ if (NR <= 40) {sp=substr($1,1,3); print $1"\t"sp} else {print $1"\tOutgroup";}}' > species_sets.txt
  2. 在without geneflow的文件中探索

    • -c 不输出combine文件
    • -n prefix
    • -t 给定的species_tree树文件,非sample-based树文件

Dsuite运行文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## -c 不输出combine文件,DtriosCombine -n prefix
Dsuite Dtrios -c -n no_geneflow -t simulated_tree_no_geneflow.nwk chr1_no_geneflow.vcf.gz species_sets.txt


## 结果

head species_sets_no_geneflow_BBAA.txt
P1 P2 P3 Dstatistic Z-score p-value f4-ratio BBAA ABBA BABA
S01 S02 S00 0.00645161 0.228337 0.409692 7.6296e-05 40635 936 924
S01 S00 S03 0.0299321 1.08142 0.139754 0.000543936 28841 1724.75 1624.5
S01 S00 S04 0.0072971 0.308069 0.379015 0.00013279 28889.2 1691 1666.5
S01 S00 S05 0.00312175 0.133303 0.446977 5.6877e-05 28861.2 1687 1676.5
S01 S00 S06 0.00312175 0.135082 0.446273 5.69626e-05 28876.5 1687 1676.5
S01 S00 S07 0.00490269 0.210116 0.416789 8.92693e-05 28871.5 1691 1674.5
S00 S01 S08 0.0613992 1.7814 0.0374237 0.000460113 45178.5 806 712.75
S00 S01 S09 0.0704225 2.17799 0.0147034 0.000530632 45323 817 709.5
S00 S01 S10 0.07982 2.25263 0.0121413 0.000589494 45131.8 810 690.25

Dsuite为ABBA/BABA,即检测从P3到P2的渐渗信号出现为正。结果显示D值即使有0.8以上,但是Z-score和P-value都表明不排除是随机的结果,这表明其中不存在gene-flow。

结果会出现3个文件除以上的BBAA的文件。_tree.txt会按照tree的拓扑结构对每种可能trios进行排序。_Dmin会按照low bound下限来排序,尤其在各个species关系不明确的时候可以用此文件。

在R中做初步探究,查看各个统计值的差别

1
2
D_BBAA_noGF <- read.table("species_sets_no_geneflow_BBAA.txt",as.is=T,header=T)
plot(D_BBAA_noGF$Dstatistic, ylab="D",xlab="trio number")

1
2
D_BBAA_noGF[which(D_BBAA_noGF$Dstatistic > 0.7),]
P1 P2 P3 Dstatistic Z.score p.value f4.ratio BBAA ABBA BABA171 S18 S19 S00 1.000000 0.00000 NaN 7.42372e-06 179922 1.5 0324 S18 S19 S01 1.000000 0.00000 NaN 7.42743e-06 179814 1.5 0460 S18 S19 S02 1.000000 0.00000 NaN 7.39892e-06 179800 1.5 0580 S18 S19 S03 1.000000 0.00000 NaN 7.43931e-06 179778 1.5 0685 S18 S19 S04 1.000000 0.00000 NaN 7.44097e-06 179765 1.5 0690 S06 S05 S11 0.833333 2.91667 0.00176897 4.94462e-05 138120 11.0 1776 S18 S19 S05 1.000000 0.00000 NaN 7.43290e-06 179768 1.5 0854 S18 S19 S06 1.000000 0.00000 NaN 7.44536e-06 179777 1.5 0920 S18 S19 S07 1.000000 0.00000 NaN 7.42259e-06 179771 1.5 0

图中可以看到一些D值甚至大于0.7,我们对数据进行筛选,单独筛选得到D>0.7的值会发现是因为ABBA/BABA的统计count过少,以至于不具有代表性。另一方面,我们发现存在Pvalue是0.001.

1
plot(D_BBAA_noGF$p.value, ylab="p value",xlab="trio number",ylim=c(0,0.05))

而对于那些pvaue<0.05的值,实际上,也是存在问题的。需要根据总体样本进行多重检验矫正。通常是BH(Benjamini-Hochberg)矫正,以控制假阳性。

1
plot(p.adjust(D_BBAA_noGF$p.value,method="BH"), ylab="p value",xlab="trio number",ylim=c(0,0.05))

但即使如此我们也会发现存在少量几个十分显著的pvalue。但实际上这些还是假阳性。任何假设统计都会有假阳性和假阴性(All hypothesis testing has false positives and false negatives)。一个重要的经验就是:It may be helpful to focus less on statistical testing and aim for a more nuanced understanding of the dataset

我们同时以f4-ratio的值做判断,f4-ratio是估算基因组中受到geneflow影响的基因组比率。

1
plot(D_BBAA_noGF$f4.ratio, ylab="f4-ratio",xlab="trio number", ylim=c(0,1))

我们同时可以画heatmap图表示从P3到P2可能的introgression。

1
2
3
cut -f 2 species_sets.txt | uniq | head -n 20 > plot_order.txt
ruby plot_d.rb species_sets_no_geneflow_BBAA.txt plot_order.txt 0.7 species_sets_no_geneflow_BBAA_D.svg
ruby plot_f4ratio.rb species_sets_no_geneflow_BBAA.txt plot_order.txt 0.2 species_sets_no_geneflow_BBAA_f4ratio.svg


可以判断出不同的纵坐标P3对P2发生渐渗的比率

  1. 在with geneflow的文件中做探索存在的introgression
1
Dsuite Dtrios -c -n with_geneflow -t simulated_tree_with_geneflow.nwk with_geneflow.vcf.gz species_sets.txt

结果文件_BBAA, _tree.txt, _Dmin.txt 文件,differences between the _tree.txt file and the _BBAA.txt reflect that, under geneflow, sister species often do not share the most derived alleles.

随后对_BBAA.txt 文件进行探索:

1
2
3
plot(D_BBAA$Dstatistic, ylab="D",xlab="trio number") # The D statistic
plot(p.adjust(D_BBAA$p.value,method="BH"), ylab="corected p value",xlab="trio number",ylim=c(0,0.05))
plot(D_BBAA$f4.ratio, ylab="f4-ratio",xlab="trio number", ylim=c(0,0.2))



我们得到了许多的gene-flow的信号,暗示着各种可能的introgression信号。而实际上这批模拟数据只包含有5个实际的gene-flow,软件得到的各种可能
This is because the test statistics are correlated when trios share an (internal) branch in the overall population or species tree,物种数目一多,就不好判断是单对单的哪一个特定的introgression。

1
2
ruby plot_d.rb species_sets_with_geneflow_BBAA.txt plot_order.txt 0.7 species_sets_with_geneflow_BBAA_D.svg
ruby plot_f4ratio.rb species_sets_with_geneflow_BBAA.txt plot_order.txt 0.2 species_sets_with_geneflow_BBAA_f4ratio.svg

一些共同的区块就可以被用作判断这些个相近类群中可能存在的gene-flow,
the plot deals with some of the correlated signals and redundancy in the data by focusing on the overall support for geneflow between pairs of species or their ancestors, which could have happened at any time in the past since the species in P2 and P3 positions diverged from each other.

同时,由Malinsky所提出https://doi.org/10.1038/s41559-018-0717-x我们可以利用另一种方法,f-branch可以将gene flow指定到特定的internal branches。其基本原理 The logic of f-branch is illustated in the following figure. The panel (c) provides an example illustrating interdependences between different f4-ratio scores, which can be informative about the timing of introgression. In this example, different choices for the P1 population provide constraints on when the gene flow could have happened. (d) Based on relationships between the f4-ratio results from different four taxon tests, the f-branch, or fb statistic, distinguishes between admixture at different time periods, assigning signals to different (possibly internal) branches in the population/species tree

1
2
Dsuite Fbranch simulated_tree_with_geneflow.nwk species_sets_with_geneflow_tree.txt > species_sets_with_geneflow_Fbranch.txt
python3 /Users/milanmalinsky/Sanger_work/Dsuite/Dsuite/utils/dtools.py species_sets_with_geneflow_Fbranch.txt simulated_tree_with_geneflow.nwk

2. Finding specific introgressed loci.

此部分教程练习利用到Malinsky et al. (2018)的数据。文章中的一个点就是 两个深水的cichlids显示了适应性的选择信号,并且定位到了scaffold18上的两个green-sensitive opsin genes。为了探究这些共有的选择gene信号是由于 convergent evolution或者是adaptive introgression。我们利用Dsuite中的Dinvestigate计算f_dM统计值。

利用Dsuite Dinvestigate进行计算各个染色体introgression信号的扫描。文件接受VCF文件;SETS文件(sample \t spcies),TestTrios文件(P1 P2 P3)。

1
2
## -w 为50snp,每25step
Dsuite Dinvestigate -w 50,25 scaffold_18.vcf.gz MalawiSetsFile.txt MalawiTestTriosForInvestigate.txt

结果会生成D, f_d, f_dM, d_f值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# read in the results with 50 SNP windows and a step of 25 SNPs
bigStep <- read.table("mbuna_deep_Diplotaxodon_localFstats__50_25.txt",as.is=T,header=T)

# plot D in windows along scaffold 18:
plot(bigStep$windowStart/1000000, bigStep$D,type="l",xlab="scaffold 18 coordinate (Mb)",ylab="D (ABBA-BABA)")
# plot f_dM in windows along scaffold 18:
plot(bigStep$windowStart/1000000, bigStep$f_dM,type="l",xlab="scaffold 18 coordinate (Mb)",ylab="f_dM")
# plot f_d in windows along scaffold 18:
plot(bigStep$windowStart/1000000, bigStep$f_d,type="l",xlab="scaffold 18 coordinate (Mb)",ylab="f_d")
# use f_dM and zoom-in on the region of interest
plot(bigStep$windowStart/1000000, bigStep$f_dM,type="l",xlim=c(4,4.5),xlab="scaffold 18 coordinate (Mb)",ylab="f_dM",ylim=c(-0.2,0.8))

# Same window: 50SNPs, and the smallest 1 SNP step
smallestStep <- read.table("mbuna_deep_Diplotaxodon_localFstats__50_1.txt",as.is=T,header=T)
# plot f_dM again:
plot(smallestStep$windowStart/1000000, smallestStep$f_dM,type="l",xlim=c(4,4.5),xlab="scaffold 18 coordinate (Mb)",ylab="f_dM",ylim=c(-0.2,0.8))

# Smaller 10SNP window, and the smallest 1 SNP step
smallerWindow <- read.table("mbuna_deep_Diplotaxodon_localFstats__10_1.txt",as.is=T,header=T)
# plot f_dM again:
plot(smallerWindow$windowStart/1000000, smallerWindow$f_dM,type="l",xlim=c(4.0,4.5),xlab="scaffold 18 coordinate (Mb)",ylab="f_dM",ylim=c(-0.2,0.9),pch=16)

# Finally, a tiny window with 2 SNPs and a step of 1 SNP
smallWindow <- read.table("mbuna_deep_Diplotaxodon_localFstats__2_1.txt",as.is=T,header=T)
plot(smallWindow$windowStart, smallWindow$f_dM,type="l",xlim=c(4000000,4500000),xlab="scaffold 18 coordinate (Mb)",ylab="f_dM",ylim=c(-0.2,0.8))


目标的opsin gene在 4.34.4Mb区域。而我们也确实观察到在4.3Mb左右区域有较明显的introgression信号。

我们再放大到特定的4.0~4.5Mb区域可以看到较好的信号区域。

当然window和steps可以根据不同的数据集进行调整。总的来说,window更小,数据更noisy,但信号也更明显;window越大,信号线条相对平缓。

3. 在Tanganyikan cichlids数据中探寻geneflow

1
2
3
4
5

Dsuite Dtrios -c -t SNAPP_tree.txt NC_031969.vcf.gz NC_031969_sets.txt

Dsuite Fbranch SNAPP_tree.txt NC_031969_sets__tree.txt > Tanganyika_Fbranch.txt
python3 dtools.py Tanganyika_Fbranch.txt SNAPP_tree.txt

4. Ancestry painting

另外一种探索祖先基因组模式在 杂交物种基因组中的表现。
A very simple alternative way of investigating patterns of ancestry in potentially introgressed or hybrid species is to “paint” their chromosomes according to the genotypes carried at sites that are fixed between the presumed parental species.

Der Sarkassian et al. (2015; Fig. 4)
the pic is drawn with two rows of cells. However, as the analyses in both studies were done with unphased data, these two rows do not represent the two haplotypes per sample. Instead, the two cells per site were simply both colored in the same way for homozygous sites or differently for heterozygous sites without regard to haplotype structure
每个个体两行,代表位点是纯和homozygous或杂合位点heterozygous。

此方法利用作者所提供的两个脚本get_fixed_site_gts.rb,和plot_fixed_site_gts.rb

  1. 对于脚本get_fixed_site_gts.rb,接受6个参数
    1. uncompressed vcf文件
    2. 输出文件名称
    3. first putative parent species:以comma分割
    4. putative hybrid species,可能的杂交物种;以comma分割
    5. second putative parent species。
    6. parental 基因型的完整度,可以过滤掉过多缺失值。
1
## fist and second parents为altfas(AUE7,AXD5) 和telvit(JBD5,JBD6),杂交物种为neocan(LJC9,LJD1)。1为排除祖先中的缺失值,仅保留非缺失值。

结果文件

  1. 绘图plot_fixed_site_gts.rb绘制ancestry painting,接受4个参数
    1. 上一步得到的文件
    2. 输出文件名称svg格式
    3. 完整度,排除杂交类群中缺失值
    4. 一段范围内仅筛选部分the minimum chromosomal distance in bp between SNPs included in the plot


结果文件
上下两个绝对颜色solid colors表明这些6069sites是祖先fixed。LJC9和LJD1为种间的杂交种,表示存在的两个祖先种的杂交信号。而这个图中有意思的点是被渐渗物种中几乎都是与祖先相对应的杂合位点。只能解释为这两个体是F1代的杂交种。如果渐渗发生的更为古老,或者曾发生回交,则后代中的大多数位点都是纯和状态,而少部分位点会显示和祖先相同的位点状态。unlike in cases where the genomes have been sequenced of parent-offspring trios, we do not know who the actual parent individuals were。We can guess that the parent individuals were members of the species Altolamprologus fasciatus and Telmatochromis vittatus, but whether the parental individuals were part of the same population as the sampled individuals or a more distantly related population within these species remains uncertain

References

https://github.com/mmatschiner/tutorials/blob/master/analysis_of_introgression_with_snp_data/README.md#SpecificLoci
Der Sarkassian et al. (2015; Fig. 4)
Stable species boundaries despite ten million years of hybridization in tropical eels

文章目录
  1. 1. Preparation
  2. 2. 1. 利用模拟数据推断物种树species-tree和基因流gene-flow
    1. 2.1. 1.1 利用msprime模拟Phylogenomic数据
    2. 2.2. 1.2 对测试数据来构建系统树
    3. 2.3. 1.3 测试introgression引起的gene-flow
  3. 3. 2. Finding specific introgressed loci.
  4. 4. 3. 在Tanganyikan cichlids数据中探寻geneflow
  5. 5. 4. Ancestry painting
  6. 6. References
|