TrieDedup: A fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

Lawrence Jianqiao Hu, Sai Luo, Ming Tian, Adam Yongxin Ye

February, 2022

Abstract

High-throughput sequencing is a powerful tool and is extensively applied in biological studies. However sequencers may report bases with low qualities and lead to ambiguous bases, ‘N’s. PCR duplicates introduced in library preparation need to be removed in genomics studies, and several deduplication tools have been developed for this purpose. However, the existing tools cannot deal with ‘N’s correctly or efficiently. Here we proposed and implemented TrieDedup, which uses trie (prefix tree) structure to compare and store sequences. TrieDedup can handle ambiguous base ‘N’s, and efficiently deduplicate at the level of raw sequences. We also reduced its memory usage by approximately 20% by implementing restrictedListDict. We benchmarked the performance of the algorithm and showed that TrieDedup can deduplicate reads up to 160-fold faster than pairwise comparison at a cost of 36-fold higher memory usage. TrieDedup algorithm may facilitate PCR deduplication, barcode or UMI assignment and repertoire diversity analysis of large scale high-throughput sequencing datasets with its ultra-fast algorithm that can account for ambiguous bases due to sequencing errors.

Type

Preprint

Create your slides in Markdown - click the Slides button to check out the example.

Add the publication’s full text or supplementary notes here. You can use rich formatting such as including code, math, and images.

Source Themes

TrieDedup: A fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

Abstract

Lawrence Jianqiao Hu

Doctoral Candidate in NeurosciencePh.D. Ambassador, UW Medicine

Doctoral Candidate in Neuroscience
Ph.D. Ambassador, UW Medicine