Computes the genetic distance in centiMorgans (cMs) between SNPs using the specified genetic
map. Applies thresholds to determine the shared DNA. Plots shared DNA. Optionally determines
shared genes (i.e., genes transcribed from the shared DNA).
All output is saved to the output directory as CSV or PNG files.
Notes
The code is commented throughout to help describe the algorithm and its operation.
To summarize, the algorithm first computes the genetic distance in cMs between SNPs
common to all individuals using the specified genetic map.
Then, individuals are compared for whether they share one or two alleles for each SNP in
common; in this manner, where all individuals share one chromosome, for example, there
will be several SNPs in a row where at least one allele is shared between individuals for
each SNP. The cM_threshold is then applied to each of these “matching segments” to
determine whether the segment could be a potential shared DNA segment (i.e., whether each
segment has a cM value greater than the threshold).
The matching segments that passed the cM_threshold are then checked to see if they
are adjacent to another matching segment, and if so, the segments are stitched together,
and the single SNP separating the segments is flagged as potentially discrepant. (This
means that multiple smaller matching segments passing the cM_threshold could be
stitched, identifying the SNP between each segment as discrepant.)
Next, the snp_threshold is applied to each segment to ensure there are enough SNPs in
the segment and the segment is not only a few SNPs in a region with a high recombination
rate; for each segment that passes this test, we have a segment of shared DNA, and the
total cMs for this segment are computed.
Finally, discrepant SNPs are checked to ensure that only SNPs internal to a shared DNA
segment are reported as discrepant (i.e., don’t report a discrepant SNP if it was part of a
segment that didn’t pass the snp_threshold). Currently, no action other than reporting
is taken on discrepant SNPs.
Parameters:
individuals (iterable of Individuals)
cM_threshold (float) – minimum centiMorgans for each shared DNA segment
snp_threshold (int) – minimum SNPs for each shared DNA segment
shared_genes (bool) – determine shared genes
save_output (bool) – specifies whether to save output files in the output directory
genetic_map ({‘HapMap2’, ‘ACB’, ‘ASW’, ‘CDX’, ‘CEU’, ‘CHB’, ‘CHS’, ‘CLM’, ‘FIN’, ‘GBR’, ‘GIH’, ‘IBS’, ‘JPT’, ‘KHV’, ‘LWK’, ‘MKK’, ‘MXL’, ‘PEL’, ‘PUR’, ‘TSI’, ‘YRI’}) – genetic map to use for computation of shared DNA; HapMap2 corresponds to the HapMap
Phase II genetic map from the
International HapMap Project
and all others correspond to the
population-specific
genetic maps generated from the
1000 Genomes Project phased OMNI data.
Note that shared DNA is not computed on the X chromosome with the 1000 Genomes
Project genetic maps since the X chromosome is not included in these genetic maps.
Returns:
dict with the following items:
one_chrom_shared_dna (pandas.DataFrame)
segments of shared DNA on one chromosome
two_chrom_shared_dna (pandas.DataFrame)
segments of shared DNA on two chromosomes
one_chrom_shared_genes (pandas.DataFrame)
shared genes on one chromosome
two_chrom_shared_genes (pandas.DataFrame)
shared genes on two chromosomes
one_chrom_discrepant_snps (pandas.Index)
discrepant SNPs discovered while finding shared DNA on one chromosome
two_chrom_discrepant_snps (pandas.Index)
discrepant SNPs discovered while finding shared DNA on two chromosomes
Object used to represent and interact with an individual.
The Individual object maintains information about an individual. The object provides
methods for loading an individual’s genetic data (SNPs) and normalizing it for use with the
lineage framework.
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ.
The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004 Jan
1;32(Database issue):D493-6. PubMed PMID: 14681465; PubMed Central PMCID:
PMC308837. https://www.ncbi.nlm.nih.gov/pubmed/14681465
Tyner C, Barber GP, Casper J, Clawson H, Diekhans M, Eisenhart C, Fischer CM,
Gibson D, Gonzalez JN, Guruvadoo L, Haeussler M, Heitner S, Hinrichs AS,
Karolchik D, Lee BT, Lee CM, Nejad P, Raney BJ, Rosenbloom KR, Speir ML,
Villarreal C, Vivian J, Zweig AS, Haussler D, Kuhn RM, Kent WJ. The UCSC Genome
Browser database: 2017 update. Nucleic Acids Res. 2017 Jan 4;45(D1):D626-D634.
doi: 10.1093/nar/gkw1134. Epub 2016 Nov 29. PubMed PMID: 27899642; PubMed Central
PMCID: PMC5210591. https://www.ncbi.nlm.nih.gov/pubmed/27899642
International Human Genome Sequencing Consortium. Initial sequencing and
analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921.
http://dx.doi.org/10.1038/35057062
genetic_map ({‘HapMap2’, ‘ACB’, ‘ASW’, ‘CDX’, ‘CEU’, ‘CHB’, ‘CHS’, ‘CLM’, ‘FIN’, ‘GBR’, ‘GIH’, ‘IBS’, ‘JPT’, ‘KHV’, ‘LWK’, ‘MKK’, ‘MXL’, ‘PEL’, ‘PUR’, ‘TSI’, ‘YRI’}) – HapMap2 corresponds to the HapMap Phase II genetic map from the
International HapMap Project
and all others correspond to the
population-specific
genetic maps generated from the
1000 Genomes Project phased OMNI data.
Returns:
dict of pandas.DataFrame genetic maps if loading was successful, else {}
Genetic maps were generated for each population separately using LDhat
(http://ldhat.sourceforge.net/). Haplotypes were split into 2000 SNP windows
with an overlap of 200 SNPs between each window. The recombination rate was
estimated for each window independently, using a block penalty of 5 for a
total of 22.5 million iterations with a sample being taken from the MCMC
chain every 15,000 iterations. The first 7.5 million iterations were
discarded as burn in. Once rates were estimated, windows were merged by
splicing the estimates at the mid-point of the overlapping regions.
LDhat estimates the population genetic recombination rate, rho = 4Ner. In
order to convert to per-generation rates (measured in cM/Mb), the LDhat
rates were compared to pedigree-based rates from Kong et al. (2010).
Specifically, rates were binned at the 5Mb scale, and a linear regression
performed between the two datasets. The gradient of this line gives an
estimate of 4Ne, allowing the population based rates to be converted to
per-generation rates.
Returns:
dict of pandas.DataFrame population-specific 1000 Genomes Project genetic maps if
loading was successful, else {}