Hi,
What is the best way to get the gene lengths of each gene in ROSMAP_all_counts_matrix.txt.gz (syn8691134)?
Thanks
Kevin
Created by Kevin Hu kevin.hu Hi Kevin,
It looks like Gencode v24 was used to process the bam files into the count matrix re: syn9757881.
You can download the gzipped GTF file here:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz
The R code below will let you extract the gene lengths from the unzipped GTF file, just replace with the path to the GTF.
```
library(GenomicRanges)
library(rtracklayer)
library(genoset)
GTF <- import.gff( "/gencode.v24.annotation.gtf", format="gtf", genome="GRCh38.p5", feature.type="exon")
grl <- reduce(split(GTF, elementNROWS(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
```
Best,
Jake