Code Documentation

lineage

lineage

tools for genetic genealogy and the analysis of consumer DNA test results

class lineage.Lineage(output_dir='output', resources_dir='resources', parallelize=False, processes=2)[source]

Bases: object

Object used to interact with the lineage framework.

__init__(output_dir='output', resources_dir='resources', parallelize=False, processes=2)[source]

Initialize a Lineage object.

Parameters:
  • output_dir (str) – name / path of output directory

  • resources_dir (str) – name / path of resources directory

  • parallelize (bool) – utilize multiprocessing to speedup calculations

  • processes (int) – processes to launch if multiprocessing

create_individual(name, raw_data=(), **kwargs)[source]

Initialize an individual in the context of the lineage framework.

Parameters:
  • name (str) – name of the individual

  • raw_data (str, bytes, SNPs (or list or tuple thereof)) – path(s) to file(s), bytes, or SNPs object(s) with raw genotype data

  • **kwargs – parameters to snps.SNPs and/or snps.SNPs.merge

Returns:

Individual initialized in the context of the lineage framework

Return type:

Individual

download_example_datasets()[source]

Download example datasets from openSNP.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Returns:

paths – paths to example datasets

Return type:

list of str or empty str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

find_discordant_snps(individual1, individual2, individual3=None, save_output=False)[source]

Find discordant SNPs between two or three individuals.

Parameters:
  • individual1 (Individual) – reference individual (child if individual2 and individual3 are parents)

  • individual2 (Individual) – comparison individual

  • individual3 (Individual) – other parent if individual1 is child and individual2 is a parent

  • save_output (bool) – specifies whether to save output to a CSV file in the output directory

Returns:

discordant SNPs and associated genetic data

Return type:

pandas.DataFrame

References

  1. David Pike, “Search for Discordant SNPs in Parent-Child Raw Data Files,” David Pike’s Utilities, http://www.math.mun.ca/~dapike/FF23utils/pair-discord.php

  2. David Pike, “Search for Discordant SNPs when given data for child and both parents,” David Pike’s Utilities, http://www.math.mun.ca/~dapike/FF23utils/trio-discord.php

find_shared_dna(individuals=(), cM_threshold=0.75, snp_threshold=1100, shared_genes=False, save_output=True, genetic_map='HapMap2')[source]

Find the shared DNA between individuals.

Computes the genetic distance in centiMorgans (cMs) between SNPs using the specified genetic map. Applies thresholds to determine the shared DNA. Plots shared DNA. Optionally determines shared genes (i.e., genes transcribed from the shared DNA).

All output is saved to the output directory as CSV or PNG files.

Notes

The code is commented throughout to help describe the algorithm and its operation.

To summarize, the algorithm first computes the genetic distance in cMs between SNPs common to all individuals using the specified genetic map.

Then, individuals are compared for whether they share one or two alleles for each SNP in common; in this manner, where all individuals share one chromosome, for example, there will be several SNPs in a row where at least one allele is shared between individuals for each SNP. The cM_threshold is then applied to each of these “matching segments” to determine whether the segment could be a potential shared DNA segment (i.e., whether each segment has a cM value greater than the threshold).

The matching segments that passed the cM_threshold are then checked to see if they are adjacent to another matching segment, and if so, the segments are stitched together, and the single SNP separating the segments is flagged as potentially discrepant. (This means that multiple smaller matching segments passing the cM_threshold could be stitched, identifying the SNP between each segment as discrepant.)

Next, the snp_threshold is applied to each segment to ensure there are enough SNPs in the segment and the segment is not only a few SNPs in a region with a high recombination rate; for each segment that passes this test, we have a segment of shared DNA, and the total cMs for this segment are computed.

Finally, discrepant SNPs are checked to ensure that only SNPs internal to a shared DNA segment are reported as discrepant (i.e., don’t report a discrepant SNP if it was part of a segment that didn’t pass the snp_threshold). Currently, no action other than reporting is taken on discrepant SNPs.

Parameters:
  • individuals (iterable of Individuals)

  • cM_threshold (float) – minimum centiMorgans for each shared DNA segment

  • snp_threshold (int) – minimum SNPs for each shared DNA segment

  • shared_genes (bool) – determine shared genes

  • save_output (bool) – specifies whether to save output files in the output directory

  • genetic_map ({‘HapMap2’, ‘ACB’, ‘ASW’, ‘CDX’, ‘CEU’, ‘CHB’, ‘CHS’, ‘CLM’, ‘FIN’, ‘GBR’, ‘GIH’, ‘IBS’, ‘JPT’, ‘KHV’, ‘LWK’, ‘MKK’, ‘MXL’, ‘PEL’, ‘PUR’, ‘TSI’, ‘YRI’}) – genetic map to use for computation of shared DNA; HapMap2 corresponds to the HapMap Phase II genetic map from the International HapMap Project and all others correspond to the population-specific genetic maps generated from the 1000 Genomes Project phased OMNI data. Note that shared DNA is not computed on the X chromosome with the 1000 Genomes Project genetic maps since the X chromosome is not included in these genetic maps.

Returns:

dict with the following items:

one_chrom_shared_dna (pandas.DataFrame)

segments of shared DNA on one chromosome

two_chrom_shared_dna (pandas.DataFrame)

segments of shared DNA on two chromosomes

one_chrom_shared_genes (pandas.DataFrame)

shared genes on one chromosome

two_chrom_shared_genes (pandas.DataFrame)

shared genes on two chromosomes

one_chrom_discrepant_snps (pandas.Index)

discrepant SNPs discovered while finding shared DNA on one chromosome

two_chrom_discrepant_snps (pandas.Index)

discrepant SNPs discovered while finding shared DNA on two chromosomes

Return type:

dict

lineage.individual

Class for representing individuals within the lineage framework.

class lineage.individual.Individual(name, raw_data=(), **kwargs)[source]

Bases: SNPs

Object used to represent and interact with an individual.

The Individual object maintains information about an individual. The object provides methods for loading an individual’s genetic data (SNPs) and normalizing it for use with the lineage framework.

Individual inherits from snps.SNPs. See here for details about the SNPs object: https://snps.readthedocs.io/en/latest/snps.html

__init__(name, raw_data=(), **kwargs)[source]

Initialize an Individual object.

Parameters:
  • name (str) – name of the individual

  • raw_data (str, bytes, SNPs (or list or tuple thereof)) – path(s) to file(s), bytes, or SNPs object(s) with raw genotype data

  • **kwargs – parameters to snps.SNPs and/or snps.SNPs.merge

get_var_name()[source]
property name

Get this Individual’s name.

Return type:

str

lineage.resources

Class for downloading and loading required external resources.

lineage uses tables and data from UCSC’s Genome Browser:

References

  1. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6. PubMed PMID: 14681465; PubMed Central PMCID: PMC308837. https://www.ncbi.nlm.nih.gov/pubmed/14681465

  2. Tyner C, Barber GP, Casper J, Clawson H, Diekhans M, Eisenhart C, Fischer CM, Gibson D, Gonzalez JN, Guruvadoo L, Haeussler M, Heitner S, Hinrichs AS, Karolchik D, Lee BT, Lee CM, Nejad P, Raney BJ, Rosenbloom KR, Speir ML, Villarreal C, Vivian J, Zweig AS, Haussler D, Kuhn RM, Kent WJ. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 2017 Jan 4;45(D1):D626-D634. doi: 10.1093/nar/gkw1134. Epub 2016 Nov 29. PubMed PMID: 27899642; PubMed Central PMCID: PMC5210591. https://www.ncbi.nlm.nih.gov/pubmed/27899642

  3. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062

  4. hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19

  5. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  6. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class lineage.resources.Resources(*args, **kwargs)[source]

Bases: Resources

Object used to manage resources required by lineage.

__init__(resources_dir='resources')[source]

Initialize a Resources object.

Parameters:

resources_dir (str) – name / path of resources directory

download_example_datasets()[source]

Download example datasets from openSNP.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Returns:

paths to example datasets

Return type:

list of str or empty str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

get_all_resources()[source]

Get / download all resources (except reference sequences) used throughout lineage.

Returns:

dict of resources

Return type:

dict

get_cytoBand_hg19()[source]

Get UCSC cytoBand table for Build 37.

Returns:

cytoBand table if loading was successful, else empty DataFrame

Return type:

pandas.DataFrame

References

  1. Ryan Dale, GitHub Gist, https://gist.github.com/daler/c98fc410282d7570efc3#file-ideograms-py

get_genetic_map(genetic_map)[source]

Get specified genetic map.

Parameters:

genetic_map ({‘HapMap2’, ‘ACB’, ‘ASW’, ‘CDX’, ‘CEU’, ‘CHB’, ‘CHS’, ‘CLM’, ‘FIN’, ‘GBR’, ‘GIH’, ‘IBS’, ‘JPT’, ‘KHV’, ‘LWK’, ‘MKK’, ‘MXL’, ‘PEL’, ‘PUR’, ‘TSI’, ‘YRI’}) – HapMap2 corresponds to the HapMap Phase II genetic map from the International HapMap Project and all others correspond to the population-specific genetic maps generated from the 1000 Genomes Project phased OMNI data.

Returns:

dict of pandas.DataFrame genetic maps if loading was successful, else {}

Return type:

dict

get_genetic_map_1000G_GRCh37(pop)[source]

Get population-specific 1000 Genomes Project genetic map. [1] [2]

Notes

From README_omni_recombination_20130507 [1] :

Genetic maps generated from the 1000G phased OMNI data.

[Build 37] OMNI haplotypes were obtained from the Phase 1 dataset (/vol1/ftp/phase1/analysis_results/supporting/omni_haplotypes/).

Genetic maps were generated for each population separately using LDhat (http://ldhat.sourceforge.net/). Haplotypes were split into 2000 SNP windows with an overlap of 200 SNPs between each window. The recombination rate was estimated for each window independently, using a block penalty of 5 for a total of 22.5 million iterations with a sample being taken from the MCMC chain every 15,000 iterations. The first 7.5 million iterations were discarded as burn in. Once rates were estimated, windows were merged by splicing the estimates at the mid-point of the overlapping regions.

LDhat estimates the population genetic recombination rate, rho = 4Ner. In order to convert to per-generation rates (measured in cM/Mb), the LDhat rates were compared to pedigree-based rates from Kong et al. (2010). Specifically, rates were binned at the 5Mb scale, and a linear regression performed between the two datasets. The gradient of this line gives an estimate of 4Ne, allowing the population based rates to be converted to per-generation rates.

Returns:

dict of pandas.DataFrame population-specific 1000 Genomes Project genetic maps if loading was successful, else {}

Return type:

dict

References

get_genetic_map_HapMapII_GRCh37()[source]

Get International HapMap Consortium HapMap Phase II genetic map for Build 37. [4] [5]

Returns:

dict of pandas.DataFrame HapMapII genetic maps if loading was successful, else {}

Return type:

dict

References

get_kgXref_hg19()[source]

Get UCSC kgXref table for Build 37.

Returns:

kgXref table if loading was successful, else empty DataFrame

Return type:

pandas.DataFrame

get_knownGene_hg19()[source]

Get UCSC knownGene table for Build 37.

Returns:

knownGene table if loading was successful, else empty DataFrame

Return type:

pandas.DataFrame

lineage.visualization

Chromosome plotting functions.

Notes

Adapted from Ryan Dale’s GitHub Gist for plotting chromosome features. [6]

References

lineage.visualization.plot_chromosomes(one_chrom_match, two_chrom_match, cytobands, path, title, build)[source]

Plots chromosomes with designated markers.

Parameters:
  • one_chrom_match (pandas.DataFrame) – segments to highlight on the chromosomes representing one shared chromosome

  • two_chrom_match (pandas.DataFrame) – segments to highlight on the chromosomes representing two shared chromosomes

  • cytobands (pandas.DataFrame) – cytobands table loaded with Resources

  • path (str) – path to destination .png file

  • title (str) – title for plot

  • build ({37}) – human genome build