TelomereHunter is a tool for estimating telomere content from human whole-genome sequencing data. It is designed to take BAM files from a tumor and a matching control sample as input. However, it is also possible to run TelomereHunter with one input file. TelomereHunter extracts and sorts telomeric reads from the input sample(s). For the estimation of telomere content, GC biases are taken into account. Finally, the results of TelomereHunter are visualized in several diagrams.
TelomereHunter is available for download at the following address: https://pypi.python.org/pypi/telomerehunter/
Further information can be found in the research article.
The individual steps of the tool are explained in more detail in the following sections.
Usage: telomerehunter -ibt TUMOR_BAM -ibc CONTROL_BAM -o OUTPUT_DIRECTORY -p PID [options]
All possible TelomereHunter options can be found using the -h/--help option.
For citing TelomereHunter in your research use:
TelomereHunter – in silico estimation of telomere content and composition from cancer genomes
Lars Feuerbach, Lina Sieverling, Katharina I. Deeg, Philip Ginsbach, Barbara Hutter, Ivo Buchhalter, Paul A. Northcott, Sadaf S. Mughal, Priya Chudasama, Hanno Glimm, Claudia Scholl, Peter Lichter, Stefan Fröhling, Stefan M. Pfister, David T. W. Jones, Karsten Rippe & Benedikt Brors
BMC Bioinformaticsvolume 20, Article number: 272 (2019)
Filtering of telomere reads
From each input sample, all reads containing a specified number of telomeric repeats are extracted. Besides the number of telomeric repeats, the user can also specify whether the repeats should be consecutive or non-consecutive (default: non-consecutive) and which hexameric repeat types to search for (default: TTAGGG, TCAGGG, TGAGGG and TTGGGG). Using default filtering criteria, TelomereHunter searches for all four telomeric repeat types and the number of repeats is calculated depending on the read length with the following formula: floor(read_length*0.06).
Secondary and supplementary alignments in the BAM file are not extracted. If the -d/--removeDuplicates option is specified, reads marked as duplicates are also not extracted. All extracted telomere reads are written into an indexed BAM file (*_filtered.bam). This BAM file is additionally sorted by read names (*_filtered_name_sorted.bam).
For each telomeric and non-telomeric read, the chromosome and band to which it was mapped are retrieved. Reads aligned to sequences other than autosomes or allosomes, or reads with a mapping quality lower than a specified threshold (default: 8) are considered unmapped. This information is used to generate a text file containing a table with the total number of reads mapped to each chromosome band and the total number of unmapped reads (*_readcount.tsv).
During the filtering step, the GC content of each read is determined if the amount of Ns in the read sequence is less than or equal to 20%. A table with the total number of reads for each possible GC content is generated (*_gc_content.tsv).
Because the filtering of telomere reads is the most time-consuming step of TelomereHunter, the user is asked whether he wants to run this step again if the output files already exist. If the answer is no, the filtering step will be skipped and TelomereHunter will start with the sorting step. Alternatively, the user can set the -nf/--noFiltering option. In this case, the filtering step is only run if the output files don't already exist. Setting this option is recommended when automatically (re-)submitting TelomereHunter jobs, because it does not require a user input.
Sorting of telomere reads
The extracted telomere reads from the filtering step are sorted into four different fractions depending on their mapping position. If the input is paired-end sequencing data and both mates have been extracted as telomere reads, the mapping position of the mate is also taken into account for the sorting.
If both mates are considered unmapped, i.e. they have a mapping quality below the specified threshold, they are sorted into the intratelomeric fraction. The junction spanning group comprises pairs in which one mate is unmapped and the other is mapped to the first or last band of a chromosome. Subtelomeric reads are those in which both mates are aligned to a first or last chromosome band. Reads are defined as intrachromosomal if at least one mate is mapped to a band that is not the first or last band of a chromosome.
Single-end reads or paired-end reads whose mates are not considered to be telomeric are sorted into the intratelomeric, subtelomeric or intrachromosomal fraction depending only on their mapping position.
For each of the four sorting fractions, a BAM file containing the telomere reads of the group is generated (*_filtered_intratelomeric.bam, *_filtered_junctionspanning.bam, *_filtered_subtelomeric.bam, *_filtered_intrachromosomal.bam). Furthermore, the number of telomere reads in each chromosome band and the number of junction spanning telomere reads at each chromosome end can be obtained from an output table called *.spectrum. The occurrences of the searched telomere repeat types in the telomere reads of every band are also presented in this table.
Estimating telomere content
To account for different library sizes and GC biases, the telomere content of a sample is estimated from the number of intratelomeric reads normalized by the total number of reads with a GC composition similar to that of telomeres. Because the GC content of the generic t-type repeat is 50%, the default GC content used for the normalization is 48-52%. However, the user can also define other limits for the GC correction using the options -gc1/--lowerGC and -gc2/--upperGC.
The results of the telomere content estimation can be found in the output *_summary.tsv file. The telomere content estimated by TelomereHunter is the number of intratelomeric reads per million reads with telomeric gc content:
tel_content = intratel_reads * 1,000,000 / total_reads_with_tel_gc
Screening for telomere variant repeats
Telomere variant repeats (TVRs) of the type NNNGGG (and the reverse complement) are searched in the intratelomeric reads, where 'N' can stand for A, C, G or T. The TVR is only counted if all six bases have a base quality of at least 20. For each sample, the raw TVR counts, the frequency relative to the total number of NNNGGG repeats and the average base quality at each position are saved in TVRs/*_TVRs.txt.
Different normalization results are summarized in the *_normalized_TVR_counts.tsv table. Here, the raw counts of each pattern in the tumor ("T") and control sample ("C) were divided by the total number of reads (column "Count_norm_by_all_reads_*") and the total number of intratelomeric reads ("Count_norm_by_intratel_reads*"). In the column "Count_per_100_bp_intratel_read_*", the raw count is normalized to an intratelomeric read length of 100 bp, which is calculated as follows:
Count_per_100_bp_intratel_read = TVR_count * 100 / summed_intratelomeric_read_lengths
The respective tumor/control log2 ratios are given in the columns "log2_ratio_count_norm_by_intratel_reads" and "log2_ratio_count_per_100_bp_intratel_read". In the columns singleton_count_*_norm, the raw counts are normalized to the total number of reads in the sample. The tumor/control log2 ratios are given in the columns "singleton_count_log2_ratio" and "singleton_count_log2_ratio_norm". In the column "distance_to_expected_singleton_log2_ratio", the telomere content log2 ratio is subtracted from the normalized singleton log2 ratio. It gives an estimate whether the singleton counts is higher (positive value) or lower (negative value) than expected given a certain telomere content. Please be aware, that this may not always be accurate depending on the TVR.
Getting the sequence context of telomere variant repeats
For selected TVRs (default: TCAGGG TGAGGG TTGGGG TTCGGG TTTGGG ATAGGG CATGGG CTAGGG GTAGGG TAAGGG, can be set with the -rc/--repeatsContext parameter), the sequence context on either side of the TVR is analyzed. The number of base pairs to investigate can be set with the -bp/--bpContext parameter (default: 18 bp). For each TVR, three raw count tables are generated in the directory TVR_context for the X bp context before, after and on either side of the TVR (*_neighborhood_before.tsv, *_neighborhood_after.tsv, *neighborhood.tsv, respectively). Each of them gives the counts for different sequence contexts as well as the percentage. For further analyses, only the X bp context on either side of the TVR are used.
The top occurring contexts of the tumor and control samples are summarized in *_TVR_top_contexts.tsv.
Counts of singletons (TVRs surrounded by the canonical t-type telomere repeats, an example would be TTAGGGTTAGGGTTAGGG-TGAGGG-TTAGGGTTAGGGTTAGGG for the TVR TGAGGG and 18 bp context) are extracted in the *_singletons.tsv table. For each selected TVR (column "pattern"), the raw counts are given in the columns "singleton_count_tumor" and "singleton_count_control".
Visualization of TelomereHunter results
The results of TelomereHunter are summarized in several different diagrams:
- Bar plots for each chromosome showing the number of telomere reads mapped to each band normalized by band length in bases and the total number of reads in the sample:
telomere_reads_band * 1,000,000/band_length * 1,000,000,000/total number of reads
Junction spanning reads shown in these plots are only normalised by the total number of reads:
telomere_reads_junction * 1,000,000,000/total number of reads
- Bar plot summarizing the number of telomere reads (per million total reads) in each of the four telomere read fractions.
- Bar plot showing the gc corrected telomere content of the analyzed samples.
In all bar plots mentioned above, the relative occurrences of the searched telomere repeat types are represented as stacks. If the -prc/--plotRevCompl option is set, TelomereHunter distinguishes between forward and reverse repeats as seen in the BAM file. Please note that reads aligned to the reverse strand may already be reverse complemented by the alignment tool and can therefore lead to confounding results in the barplot.
- Diagram with the GC content distribution in all reads (top) and in intratelomeric reads (bottom). The GC bins used for the GC correction in the telomere content estimation are highlighted.
- Histograms of repeat frequencies per intratelomeric read in each sample. Above the histograms, the cumulative percentage of reads containing each possible number of telomere repeats is shown.
- Barplot showing TVR counts in intratelomeric reads. On the left, the TVR counts normalized to the total number of intratelomeric reads are shown. Only patterns with a normalized count larger than 0.01 in the tumor or in the control sample are shown. TTAGGG was excluded and is only shown in the small inset plot. If a tumor and a control sample were run, the tumor/control log2 ratios are displayed on the right. log2 ratios larger than 5 are set to 5 for the plot.
- Scatterplot showing the TVR counts in the tumor sample against those in the control sample. The axes are shown on log scale and counts of 0 are set to 0.000001. This plot is only generated if a tumor and a control sample were run
- Bar plots showing raw and normalized singleton counts. If a tumor and a control sample were run, the tumor/control log2 ratio of the normalized singleton counts as well as the distance to the expected singleton count log2 ratio is shown.
Using default settings, all of the diagrams are generated in pdf format. However, the user can also specify which specific diagrams to generate and whether these should be in png, svg and/or pdf format.
To generate the diagrams, TelomereHunter requires R with the following libraries: ggplot2, reshape2, gridExtra, RColorBrewer. If it is not possible or you do not wish to install this on your computer, please use the option -p8/--plotNone.
Requirements
- operating system: Linux
TelomereHunter has been developed and tested using the following software versions. Should there be any complications with newer versions of the mentioned software (packages), please contact us.
- for telomere read extraction and calculation of telomere content
- python 2.7.9 (does not work for python 3!)
- pysam 0.9.0
- PyPDF2 1.26.0
- samtools 1.3.1
- python 2.7.9 (does not work for python 3!)
- for visualization
- R 3.3.0
- ggplot2 2.1.0
- reshape2 1.4.1
- gridExtra 2.2.1
- RColorBrewer 1.1-2
- cowplot 0.9.2
- svglite 1.2.1
- R 3.3.0
Contact Lina Sieverling (l.sieverling@dkfz-heidelberg.de) or Lars Feuerbach (l.feuerbach@dkfz.de) for questions and support on TelomereHunter.