Hi
I noticed that loading all the vcf files takes a long time. Considering the limited resources of the docker machine and the fact that everyone needs to load the vcf files I think it would be helpful to also offer a table (csv) version of all these variants merged together including annotations. This would reduce a big overhead on the docker machine.
thanks
DA
Created by exquirentibus veritatem exquirentibus thanks sorry should explain what it does. unzips 1 vcf.gz, filters for mutations that 'passed', pops out a .csv of these, and then zips the original file back. Loops for however many vcfs are in the directory. ```{r}
library(doParallel)
library(data.table)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
library(devtools)
library(tidyverse)
#install.packages("R.utils")
library(R.utils)
#get only passing alterations in vcfs
path = "/home/schtuff_file"
filename <- dir(path, pattern =".vcf.gz")
for(i in 1:length(filename)){
file<-fread(R.utils::gunzip(filename[i]), sep="\t")
header<-which(file[,1] == "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tTUMOR\tNORMAL")
file<-file[header:.N]
colnames(file)<-"stringy"
df<-str_split_fixed(file$stringy, "\t", 11 )
outfile <- paste(filename[i], "_all.csv")
colnames(df) = df[1, ]
df = df[-1, ]
df <- subset(df, select = -c(QUAL) )
is.na(df) <- df=="."
df<-as.data.frame(df)
df %>%
select(`#CHROM`, POS, ID, FILTER, REF, ALT) %>%
filter(FILTER=='PASS') %>%
select(-c(FILTER)) ->df
outfile <- paste(filename[i], "_pass.csv")
lapply(Sys.glob("*.vcf"), gzip)
fwrite(as.data.frame(df), file=outfile, sep = ",")
}
```