The [last of the four intro notebooks](https://github.com/Sage-Bionetworks/nf-hackathon-2019/blob/master/py_demos/4-exomeseq-intro.ipynb) looks at whole exome sequence data. This has the following columns:
* **Tumor_Sample_Barcode**.
* **Hugo_Symbol**. names of genes according to HUGO database
* **Entrez_Gene_Id**. Gene ID according to Entrez Database
* **Center**.
* **NCBI_Build**. Reference genome that was used to align the exomeSeq data
* **Chromosome**. Chromosome number (range 1-22 and X,Y), Chr M == mitochondrial genome (absent in exomes with NCBI Build == hg19)
* **Start_Position**.
* **End_Position**.
* **Strand**.
* **Variant_Classification**. See below.
* **Variant_Type**.
* **Reference_Allele**.
* **Tumor_Seq_Allele1**.
* **Tumor_Seq_Allele2**.
* **dbSNP_RS**.
* **dbSNP_Val_Status**.
* **Matched_Norm_Sample_Barcode**.
* **Match_Norm_Seq_Allele1**.
* **Match_Norm_Seq_Allele2**.
* **Tumor_Validation_Allele1**.
* **Tumor_Validation_Allele2**.
* **Match_Norm_Validation_Allele1**.
* **Match_Norm_Validation_Allele2**.
* **Verification_Status**.
* **Validation_Status**.
* **Mutation_Status**.
* **Sequencing_Phase**.
* **Sequence_Source**.
* **Validation_Method**.
* **Score**.
* **BAM_File**.
* **Sequencer**.
* **Tumor_Sample_UUID**.
* **Matched_Norm_Sample_UUID**.
* **HGVSc**.
* **HGVSp**.
* **HGVSp_Short**.
* **Transcript_ID**.
* **Exon_Number**.
* **t_depth**.
* **t_ref_count**.
* **t_alt_count**.
* **n_depth**.
* **n_ref_count**.
* **n_alt_count**.
* **all_effects**.
* **Allele**.
* **Gene**.
* **Feature**.
* **Feature_type**.
* **Consequence**.
* **cDNA_position**.
* **CDS_position**.
* **Protein_position**.
* **Amino_acids**.
* **Codons**.
* **Existing_variation**.
* **ALLELE_NUM**.
* **DISTANCE**.
* **STRAND_VEP**.
* **SYMBOL**.
* **SYMBOL_SOURCE**.
* **HGNC_ID**.
* **BIOTYPE**.
* **CANONICAL**.
* **CCDS**.
* **ENSP**.
* **SWISSPROT**.
* **TREMBL**.
* **UNIPARC**.
* **RefSeq**.
* **SIFT**.
* **PolyPhen**.
* **EXON**.
* **INTRON**.
* **DOMAINS**.
* **AF**.
* **AFR_AF**.
* **AMR_AF**.
* **ASN_AF**.
* **EAS_AF**.
* **EUR_AF**.
* **SAS_AF**.
* **AA_AF**.
* **EA_AF**.
* **CLIN_SIG**.
* **SOMATIC**.
* **PUBMED**.
* **MOTIF_NAME**.
* **MOTIF_POS**.
* **HIGH_INF_POS**.
* **MOTIF_SCORE_CHANGE**.
* **IMPACT**.
* **PICK**.
* **VARIANT_CLASS**.
* **TSL**.
* **HGVS_OFFSET**.
* **PHENO**.
* **MINIMISED**.
* **ExAC_AF**.
* **ExAC_AF_AFR**.
* **ExAC_AF_AMR**.
* **ExAC_AF_EAS**.
* **ExAC_AF_FIN**.
* **ExAC_AF_NFE**.
* **ExAC_AF_OTH**.
* **ExAC_AF_SAS**.
* **GENE_PHENO**.
* **FILTER**.
* **flanking_bps**.
* **vcf_id**.
* **vcf_qual**.
* **ExAC_AF_Adj**.
* **ExAC_AC_AN_Adj**.
* **ExAC_AC_AN**.
* **ExAC_AC_AN_AFR**.
* **ExAC_AC_AN_AMR**.
* **ExAC_AC_AN_EAS**.
* **ExAC_AC_AN_FIN**.
* **ExAC_AC_AN_NFE**.
* **ExAC_AC_AN_OTH**.
* **ExAC_AC_AN_SAS**.
* **ExAC_FILTER**.
* **gnomAD_AF**.
* **gnomAD_AFR_AF**.
* **gnomAD_AMR_AF**.
* **gnomAD_ASJ_AF**.
* **gnomAD_EAS_AF**.
* **gnomAD_FIN_AF**.
* **gnomAD_NFE_AF**.
* **gnomAD_OTH_AF**.
* **gnomAD_SAS_AF**.
* **vcf_pos**.
* **id**. Synapse ID of the sample (unique for each sample)
* **parentId**.
* **benefactorId**.
* **projectId**.
* **age**. the age of the patient
* **assay**.
* **diagnosis**.
* **individualID**.
* **nf1Genotype**.
* **nf2Genotype**.
* **organ**.
* **isCellLine**. indicates whether the origin tissue was a cell line or a patient
* **sex**. the sex of the patient
* **species**. the source of the specimen
* **specimenID**.
* **study**. the specific initiative/consortia that the study was a part of
* **studyId**.
* **disease**.
* **tumorType**. the the type of tumor, can be one of 7 different diagnoses
The [ENSEMBL Variant Classifications](https://uswest.ensembl.org/info/genome/variation/prediction/classification.html#classes) are
**Variant_classification** | **Description**
--- | ---
_Nonsense-Mutation_ | Mutation leading to change of a coding codon to stop codon
_Splice-Site_ | Mutation leading to change in splice site
_Missense-Mutation_ | Mutation resulting in change in amino acid
_In-Frame-Del_ | Deletion of nucleotides divisible by three leading to deletions of amino acids
_In-Frame-Ins_ | Insertion of nucleotides divisible by three leading to insertion of amino acids
_Frame-Shift-Ins_ | Insertions of nucleotides (not divisible by three) such that codons downstream of the insertion are shifted resulting in a malformed protein or nonsense-mediated decay
_Frame-Shift-Del_ | Deletions of nucleotides (not divisible by three) such that codons downstream of the deletion are shifted resulting in a malformed protein or nonsense-mediated decay
_Translation-Start-Site_ | Mutation causing changes in translation start site
_Nonstop-Mutation_ | SNP in stop codon that disrupts the stop codon causing continued translation
The tumorType column may be unpopulated (NaN), or one of the following types:
* Normal
* Neurofibroma
* Plexiform Neurofibroma
* Malignant Peripheral Nerve Sheath Tumor
The distribution of tumor types is more uniform than in the whole genome dataset of the last notebook. Also the exome case is more interesting because it is the "compiled" form of DNA which is smaller and presumably tractable than the whole genome which contains about 90% material which gets elided in transcription. So for purposes of drug discovery I guess we can forget about whole genome in favor of whole exome. To that extent, the unbalanced and small set of tumor types in the whole genome data is not a problem:
${imageLink?synapseId=syn20727563&align=None&scale=100&responsive=true&altText=}
The notebook then looks for a correlation between variant classification and the protein products of the genes. We look at high impact variants on the alphabetically first 100 genes in the dataset, where the IMPACT column has been assessed separately in some way which has not been described. We get a matrix plot similar to the one in the 3rd notebook, where the columns are samples and the rows are genes.
When we zoom in and look at the differential impact of different variant classifications, we see that missense mutations and in frame deletions are less destructive than 4 other observed mutation types. Which makes 6 out of 9. 3 are left out. This is because the following variant types do not occur in observed NF1 mutations:
* In_Frame_Ins
* Translation_Start_Site
* Nonstop_Mutation
The notebook concludes by making a matrix of tissue samples versus mutational events. 5 event types are compressed down to 2 proto event types. We get another cloud of colored dots. We cannot come to any conclusions, except to observe that it shows an outlier which could be bad data.