Often, mapping sequenced reads to a reference genome is the first step of analyzing next-generation sequencing data. However, a genome may contain many pieces of similar regions, making the reads derived from these similar regions difficult to map back – having no idea which region they are from. But with the information of similar regions in mind, one may pay attention to such regions and make data analysis clearer.
In fact, the UCSC genome browser has provided such resources: Mappability and Uniqueness of genomes.
There are three types data here, which I summarize in the below table:
Type | Generation Method |
---|---|
— | |
Alignability | Using GEM-mappability and allowing up to 2 mismatches, the uniqness of each k-mer is evaluated, in the formula S=1/(number of mapped places) |
— | |
Uniqueness | the same as Alignability, but no mismatches are allowed. |
— | |
Blacklisted | 229 manually curated regions with lots of reads mapped regardless of tissues, defined with chromatin accessibility and chip-seq data. |
Key notes
-
the ‘Blacklisted’ regions exclude genic or promoter regions, and it has little overlap with low alignable regions, so it is recommended to use both lists for analyses.
-
the ‘Blacklisted’ regions are based on chromatin and chip-seq data, so it may not be effective for RNA-seq data.
-
the scores (range from 0 to 1) for ‘Alignability’ and ‘Uniqueness’ are assigned to the first base of a k-mer, not the middle one.
References
-
GEM-mappability: the program used for generating alignability data.
-
This link describes how the blacklisted regions were generated. And this link directs you to the site for downloading the blacklisted regions.
-
The data for human hg19 can be downloaded here
Last modified on 2018-08-31