lineage provides a framework for analyzing genotype (raw data) files from direct-to-consumer
(DTC) DNA testing companies, primarily for the purposes of genetic genealogy.
Create an Individual in the context of the lineage framework to interact with the
User662 dataset:
>>> user662=l.create_individual('User662',['resources/662.23andme.340.txt.gz','resources/662.ftdna-illumina.341.csv.gz'])Loading SNPs('662.23andme.340.txt.gz')Merging SNPs('662.ftdna-illumina.341.csv.gz')SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37Downloading resources/NCBI36_GRCh37.tar.gz27 SNP positions were discrepant; keeping original positions151 SNP genotypes were discrepant; marking those as null
Here we created user662 with the name User662. In the process, we merged two raw data
files for this individual. Specifically:
662.23andme.340.txt.gz was loaded.
Then, 662.ftdna-illumina.341.csv.gz was merged. In the process, it was found to have Build
36. So, it was automatically remapped to Build 37 (downloading the remapping data in the
process) to match the build of the SNPs already loaded. After this merge, 27 SNP positions and
151 SNP genotypes were found to be discrepant.
user662 is represented by an Individual object, which inherits from snps.SNPs.
Therefore, all of the properties and methods
available to a SNPs object are available here; for example:
All output files are saved to
the output directory (a parameter to Lineage).
This method also returns a pandas.DataFrame, and it can be inspected interactively at
the prompt, although the same output is available in the CSV file.
lineage uses the probabilistic recombination rates throughout the human genome from the
International HapMap Project
and the 1000 Genomes Project to compute the shared DNA
(in centiMorgans) between two individuals. Additionally, lineage denotes when the shared DNA
is shared on either one or both chromosomes in a pair. For example, when siblings share a segment
of DNA on both chromosomes, they inherited the same DNA from their mother and father for that
segment.
With that background, let’s find the shared DNA between the User662 and User663 datasets,
calculating the centiMorgans of shared DNA and plotting the results:
Notice that the centiMorgan and SNP thresholds for each DNA segment can be tuned. Additionally,
notice that two files were downloaded to facilitate the analysis and plotting - future analyses
will use the downloaded files instead of downloading the files again. Finally, notice that a list
of individuals is passed to find_shared_dna… This list can contain an arbitrary number of
individuals, and lineage will find shared DNA across all individuals in the list (i.e.,
where all individuals share segments of DNA on either one or both chromosomes).
Output is returned as a dictionary with the following keys (pandas.DataFrame and
pandas.Index items):
In this example, there are 27 segments of shared DNA:
>>> len(results['one_chrom_shared_dna'])27
Also, output files are
created; these files are detailed in the documentation and their generation can be disabled with a
save_output=False argument. In this example, the output files consist of a CSV file that
details the shared segments of DNA on one chromosome and a plot that illustrates the shared DNA:
The Central Dogma of Molecular Biology
states that genetic information flows from DNA to mRNA to proteins: DNA is transcribed into
mRNA, and mRNA is translated into a protein. It’s more complicated than this (it’s biology
after all), but generally, one mRNA produces one protein, and the mRNA / protein is considered a
gene.
Therefore, it would be interesting to understand not just what DNA is shared between individuals,
but what genes are shared between individuals with the same variations. In other words,
what genes are producing the same proteins? [*] Since lineage can determine the shared DNA
between individuals, it can use that information to determine what genes are also shared on
either one or both chromosomes.
For this example, let’s create two more Individuals for the User4583 and User4584
datasets:
The plot that illustrates the shared DNA is shown below. Note that in addition to outputting the
shared DNA segments on either one or both chromosomes, the shared genes on either one or both
chromosomes are also output.
Note
Shared DNA is not computed on the X chromosome with the 1000 Genomes Project genetic
maps since the X chromosome is not included in these genetic maps.
In this example, there are 15,976 shared genes on both chromosomes transcribed from 36 segments
of shared DNA:
The various output files produced by lineage are detailed below. Output files are saved in
the output directory, which is defined at the instantiation of the Lineage
object. Generation of output files can usually be enabled or disabled via a save_output
argument to the associated method.
Discordant SNPs between two or three individuals can be identified with
find_discordant_snps(). One CSV file is optionally output when
save_output=True.
Shared DNA between two or more individuals can be identified with
find_shared_dna(). One PNG file and up to two CSV files are output when
save_output=True.
cM_threshold corresponds to the same named parameter of
find_shared_dna(); “.” is replaced by “p” with precision of 2, e.g., “0p75”
snp_threshold corresponds to the same named parameter of
find_shared_dna()
genetic_map corresponds to the same named parameter of
find_shared_dna().
Note
If more than two individuals are compared, all Individual
names will be included in the filenames and plot titles using the same conventions.
Note
Genetic maps do not have recombination rates for the Y chromosome since the Y
chromosome does not recombine. Therefore, shared DNA will not be shown on the Y
chromosome.
This plot illustrates shared DNA (i.e., no shared DNA, shared DNA on one chromosome, and shared
DNA on both chromosomes). The centromere for each chromosome is also detailed. Two examples of
this plot are shown below.
In the above plot, note that the two individuals only share DNA on one chromosome. In this plot,
the larger regions where “No shared DNA” is indicated are due to SNPs not being available in
those regions (i.e., SNPs were not tested in those regions).
In the above plot, the areas where “No shared DNA” is indicated are the regions where SNPs were
not tested or where DNA is not shared. The areas where “One chromosome shared” is indicated are
regions where the individuals share DNA on one chromosome. The areas where “Two chromosomes
shared” is indicated are regions where the individuals share DNA on both chromosomes in the pair
(i.e., the individuals inherited the same DNA from their father and mother for those regions).
Note that the regions where DNA is shared on both chromosomes is a subset of the regions where
one chromosome is shared.
Shared genes (with the same genetic variations) between two or more individuals can be
identified with find_shared_dna(), with the parameter shared_genes=True.
In addition to the outputs produced by Find Shared DNA, up to two additional CSV files are
output that detail the shared genes when save_output=True.
In the filenames below, name1 is the name of the first
Individual and name2 is the name of the second
Individual. (If more individuals are compared, all
Individual names will be included in the filenames using the same
convention.)
The instructions below provide the steps to install lineage on a
Raspberry Pi (tested with
“Raspberry Pi OS (32-bit) Lite”,
release date 2020-08-20). For more details about Python on the Raspberry Pi, see
here.
Note
Text after a prompt (e.g., $) is the command to type at the command line. The
instructions assume a fresh install of Raspberry Pi OS and that after logging in as
the pi user, the current working directory is /home/pi.
Install pip for Python 3:
pi@raspberrypi:~ $ sudo apt install python3-pip
Press “y” followed by “enter” to continue. This enables us to install packages from the
Python Package Index.
Install the venv module:
pi@raspberrypi:~ $ sudo apt install python3-venv
Press “y” followed by “enter” to continue. This enables us to create a
virtual environment to isolate the lineage
installation from other system Python packages.
Now when you invoke Python or pip, the virtual environment’s version will be used (as
indicated by the (.venv) before the prompt). This can be verified as follows:
(.venv) pi@raspberrypi:~/lineage $ which python
/home/pi/lineage/.venv/bin/python
(.venv) pi@raspberrypi:~/lineage $ python
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Use lineage; examples shown in the README should now work.
At completion of usage, the virtual environment can be deactivated:
lineage could always use more documentation, whether as part of the official lineage
docs, in docstrings, or even on the web in blog posts, articles, and such. See below for info on
how to generate documentation.
When you’re done making changes, run all the tests with:
$ pipenv run pytest --cov-report=html --cov=lineage tests
Note
Downloads during tests are disabled by default. To enable downloads, set
the environment variable DOWNLOADS_ENABLED=true.
Note
If you receive errors when running the tests, you may need to specify the temporary
directory with an environment variable, e.g., TMPDIR="/path/to/tmp/dir".
Note
After running the tests, a coverage report can be viewed by opening
htmlcov/index.html in a browser.
Check code formatting:
$ pipenv run black --check --diff .
Commit your changes and push your branch to GitHub:
$ git add .
$ git commit -m "Your detailed description of your changes."
$ git push origin name-of-your-bugfix-or-feature
Computes the genetic distance in centiMorgans (cMs) between SNPs using the specified genetic
map. Applies thresholds to determine the shared DNA. Plots shared DNA. Optionally determines
shared genes (i.e., genes transcribed from the shared DNA).
All output is saved to the output directory as CSV or PNG files.
Notes
The code is commented throughout to help describe the algorithm and its operation.
To summarize, the algorithm first computes the genetic distance in cMs between SNPs
common to all individuals using the specified genetic map.
Then, individuals are compared for whether they share one or two alleles for each SNP in
common; in this manner, where all individuals share one chromosome, for example, there
will be several SNPs in a row where at least one allele is shared between individuals for
each SNP. The cM_threshold is then applied to each of these “matching segments” to
determine whether the segment could be a potential shared DNA segment (i.e., whether each
segment has a cM value greater than the threshold).
The matching segments that passed the cM_threshold are then checked to see if they
are adjacent to another matching segment, and if so, the segments are stitched together,
and the single SNP separating the segments is flagged as potentially discrepant. (This
means that multiple smaller matching segments passing the cM_threshold could be
stitched, identifying the SNP between each segment as discrepant.)
Next, the snp_threshold is applied to each segment to ensure there are enough SNPs in
the segment and the segment is not only a few SNPs in a region with a high recombination
rate; for each segment that passes this test, we have a segment of shared DNA, and the
total cMs for this segment are computed.
Finally, discrepant SNPs are checked to ensure that only SNPs internal to a shared DNA
segment are reported as discrepant (i.e., don’t report a discrepant SNP if it was part of a
segment that didn’t pass the snp_threshold). Currently, no action other than reporting
is taken on discrepant SNPs.
Parameters:
individuals (iterable of Individuals)
cM_threshold (float) – minimum centiMorgans for each shared DNA segment
snp_threshold (int) – minimum SNPs for each shared DNA segment
shared_genes (bool) – determine shared genes
save_output (bool) – specifies whether to save output files in the output directory
genetic_map ({‘HapMap2’, ‘ACB’, ‘ASW’, ‘CDX’, ‘CEU’, ‘CHB’, ‘CHS’, ‘CLM’, ‘FIN’, ‘GBR’, ‘GIH’, ‘IBS’, ‘JPT’, ‘KHV’, ‘LWK’, ‘MKK’, ‘MXL’, ‘PEL’, ‘PUR’, ‘TSI’, ‘YRI’}) – genetic map to use for computation of shared DNA; HapMap2 corresponds to the HapMap
Phase II genetic map from the
International HapMap Project
and all others correspond to the
population-specific
genetic maps generated from the
1000 Genomes Project phased OMNI data.
Note that shared DNA is not computed on the X chromosome with the 1000 Genomes
Project genetic maps since the X chromosome is not included in these genetic maps.
Returns:
dict with the following items:
one_chrom_shared_dna (pandas.DataFrame)
segments of shared DNA on one chromosome
two_chrom_shared_dna (pandas.DataFrame)
segments of shared DNA on two chromosomes
one_chrom_shared_genes (pandas.DataFrame)
shared genes on one chromosome
two_chrom_shared_genes (pandas.DataFrame)
shared genes on two chromosomes
one_chrom_discrepant_snps (pandas.Index)
discrepant SNPs discovered while finding shared DNA on one chromosome
two_chrom_discrepant_snps (pandas.Index)
discrepant SNPs discovered while finding shared DNA on two chromosomes
Object used to represent and interact with an individual.
The Individual object maintains information about an individual. The object provides
methods for loading an individual’s genetic data (SNPs) and normalizing it for use with the
lineage framework.
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ.
The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004 Jan
1;32(Database issue):D493-6. PubMed PMID: 14681465; PubMed Central PMCID:
PMC308837. https://www.ncbi.nlm.nih.gov/pubmed/14681465
Tyner C, Barber GP, Casper J, Clawson H, Diekhans M, Eisenhart C, Fischer CM,
Gibson D, Gonzalez JN, Guruvadoo L, Haeussler M, Heitner S, Hinrichs AS,
Karolchik D, Lee BT, Lee CM, Nejad P, Raney BJ, Rosenbloom KR, Speir ML,
Villarreal C, Vivian J, Zweig AS, Haussler D, Kuhn RM, Kent WJ. The UCSC Genome
Browser database: 2017 update. Nucleic Acids Res. 2017 Jan 4;45(D1):D626-D634.
doi: 10.1093/nar/gkw1134. Epub 2016 Nov 29. PubMed PMID: 27899642; PubMed Central
PMCID: PMC5210591. https://www.ncbi.nlm.nih.gov/pubmed/27899642
International Human Genome Sequencing Consortium. Initial sequencing and
analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921.
http://dx.doi.org/10.1038/35057062
genetic_map ({‘HapMap2’, ‘ACB’, ‘ASW’, ‘CDX’, ‘CEU’, ‘CHB’, ‘CHS’, ‘CLM’, ‘FIN’, ‘GBR’, ‘GIH’, ‘IBS’, ‘JPT’, ‘KHV’, ‘LWK’, ‘MKK’, ‘MXL’, ‘PEL’, ‘PUR’, ‘TSI’, ‘YRI’}) – HapMap2 corresponds to the HapMap Phase II genetic map from the
International HapMap Project
and all others correspond to the
population-specific
genetic maps generated from the
1000 Genomes Project phased OMNI data.
Returns:
dict of pandas.DataFrame genetic maps if loading was successful, else {}
Genetic maps were generated for each population separately using LDhat
(http://ldhat.sourceforge.net/). Haplotypes were split into 2000 SNP windows
with an overlap of 200 SNPs between each window. The recombination rate was
estimated for each window independently, using a block penalty of 5 for a
total of 22.5 million iterations with a sample being taken from the MCMC
chain every 15,000 iterations. The first 7.5 million iterations were
discarded as burn in. Once rates were estimated, windows were merged by
splicing the estimates at the mid-point of the overlapping regions.
LDhat estimates the population genetic recombination rate, rho = 4Ner. In
order to convert to per-generation rates (measured in cM/Mb), the LDhat
rates were compared to pedigree-based rates from Kong et al. (2010).
Specifically, rates were binned at the 5Mb scale, and a linear regression
performed between the two datasets. The gradient of this line gives an
estimate of 4Ne, allowing the population based rates to be converted to
per-generation rates.
Returns:
dict of pandas.DataFrame population-specific 1000 Genomes Project genetic maps if
loading was successful, else {}