Digesting the 3-wgs-intro notebook

[This notebook](https://github.com/Sage-Bionetworks/nf-hackathon-2019/blob/master/py_demos/3-wgs-intro.ipynb) works with the [whole genome sequencing dataset](https://ghr.nlm.nih.gov/primer/testing/sequencing). It accesses the genomic variant data collected from normal and tumor samples of two different tumor types under two different study initiatives. The two tumor types and one control type are: * Cutaneous Neurofibroma * Low Grade Glioma * Normal The data has the [following fields:](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/#vep) * Tumor_Sample_Barcode * Hugo_Symbol. names of genes according to HUGO database * Entrez_Gene_Id. Gene ID according to Entrez Database * Center * NCBI_Build. Reference genome that was used to align the exomeSeq data * Chromosome. Chromosome number (range 1-22 and X,Y), Chr M == mitochondrial genome (absent in exomes with NCBI Build == hg19) * Start_Position * End_Position * Strand * Variant_Classification. See below * Variant_Type * Reference_Allele * Tumor_Seq_Allele1 * Tumor_Seq_Allele2 * dbSNP_RS * dbSNP_Val_Status * Matched_Norm_Sample_Barcode * Match_Norm_Seq_Allele1 * Match_Norm_Seq_Allele2 * Tumor_Validation_Allele1 * Tumor_Validation_Allele2 * Match_Norm_Validation_Allele1 * Match_Norm_Validation_Allele2 * Verification_Status * Validation_Status * Mutation_Status * Sequencing_Phase * Sequence_Source * Validation_Method * Score * BAM_File * Sequencer * Tumor_Sample_UUID * Matched_Norm_Sample_UUID * HGVSc * HGVSp * HGVSp_Short * Transcript_ID * Exon_Number * t_depth * t_ref_count * t_alt_count * n_depth * n_ref_count * n_alt_count * all_effects * Allele * Gene * Feature * Feature_type * Consequence * cDNA_position * CDS_position * Protein_position * Amino_acids * Codons * Existing_variation * ALLELE_NUM * DISTANCE * STRAND_VEP * SYMBOL * SYMBOL_SOURCE * HGNC_ID * BIOTYPE * CANONICAL * CCDS * ENSP * SWISSPROT * TREMBL * UNIPARC * RefSeq * SIFT * PolyPhen * EXON * INTRON * DOMAINS * AF * AFR_AF * AMR_AF * ASN_AF * EAS_AF * EUR_AF * SAS_AF * AA_AF * EA_AF * CLIN_SIG * SOMATIC * PUBMED * MOTIF_NAME * MOTIF_POS * HIGH_INF_POS * MOTIF_SCORE_CHANGE * IMPACT ??? * PICK * VARIANT_CLASS * TSL * HGVS_OFFSET * PHENO * MINIMISED * ExAC_AF * ExAC_AF_AFR * ExAC_AF_AMR * ExAC_AF_EAS * ExAC_AF_FIN * ExAC_AF_NFE * ExAC_AF_OTH * ExAC_AF_SAS * GENE_PHENO * FILTER * flanking_bps * vcf_id * vcf_qual * ExAC_AF_Adj * ExAC_AC_AN_Adj * ExAC_AC_AN * ExAC_AC_AN_AFR * ExAC_AC_AN_AMR * ExAC_AC_AN_EAS * ExAC_AC_AN_FIN * ExAC_AC_AN_NFE * ExAC_AC_AN_OTH * ExAC_AC_AN_SAS * ExAC_FILTER * gnomAD_AF * gnomAD_AFR_AF * gnomAD_AMR_AF * gnomAD_ASJ_AF * gnomAD_EAS_AF * gnomAD_FIN_AF * gnomAD_NFE_AF * gnomAD_OTH_AF * gnomAD_SAS_AF * vcf_pos * id. Synapse ID of the sample (unique for each sample) * age. The age of the patient * assay * diagnosis * individualID * nf1Genotype * nf2Genotype * organ * isCellLine. Indicates whether the origin tissue was a cell line or a patient * sex. The sex of the patient * species. The source of the specimen * specimenID * study. The specific initiative/consortia that the study was a part of * studyId * disease * tumorType. The type of tumor, can be one of 7 different diagnoses The variant classifications are as follows: Variant_classification | Description --- | --- _Nonsense-Mutation_ | Mutation leading to change of a coding codon to stop codon _Splice-Site_ | Mutation leading to change in splice site _Missense-Mutation_ | Mutation resulting in change in amino acid _In-Frame-Del_ | Deletion of nucleotides divisible by three leading to deletions of amino acids _In-Frame-Ins_ | Insertion of nucleotides divisible by three leading to insertion of amino acids _Frame-Shift-Ins_ | Insertions of nucleotides (not divisible by three) such that codons downstream of the insertion are shifted resulting in a malformed protein or nonsense-mediated decay _Frame-Shift-Del_ | Deletions of nucleotides (not divisible by three) such that codons downstream of the deletion are shifted resulting in a malformed protein or nonsense-mediated decay _Translation-Start-Site_ | Mutation causing changes in translation start site _Nonstop-Mutation_ | SNP in stop codon that disrupts the stop codon causing continued translation _IGR_ | Mutations in intergenic regions There are 711,036 records. The notebook tabulates the records by sex and tumor type as follows: ``` genes_with_meta_table = pd.crosstab(index=genes_with_meta["sex"], columns=genes_with_meta["tumorType"]) ``` The resulting counts are as follows. We clearly have a lot more cutaneous neurofibroma to deal with: ${synapsetable?query=select %2A from syn20727529&showquery=true} Also good is to know how to express this as a percentage of total records: ``` gmtpct=genes_with_meta_table/genes_with_meta.shape[0] gmtpct=gmtpct.style.format({x:"{:.2%}" for x in gmtpct.columns}) ``` However when we try to store this in Synapse ``` table = build_table('Whole genome sequence sample distribution percentage', project, gmtpct) syn.store(table) ``` we get an error: ``` ValueError: Values of type is not yet supported. ``` So we try a different way around, which works: ``` gmtpct=genes_with_meta_table/genes_with_meta.shape[0] for x in gmtpct.columns: gmtpct[x]=gmtpct[x].map(lambda n: '{:,.2%}'.format(n)) ``` and the result is: ${synapsetable?query=select %2A from syn20727530&showquery=true} So we see that, all things considered, we have a tiny amount of low grade glioma data, and are perhaps more likely to succeed by focusing on cutaneous NF. This is also apparent in the bar chart produced by the notebook, where low grade glioma is barely visible as a bar: ${imageLink?synapseId=syn20727531&align=None&scale=100&responsive=true&altText=} Next the notebook builds a visualization of "high impact" mutations. The "IMPACT" column in our data is thus intriguing. It is not an ENSEMBL-defined column. It takes on 3 values: * MODERATE * HIGH * MODIFIER There is also an undescribed FILTER column. It takes on one of 3 values: * common_variant * . * PASS The code to visualize high-impact mutations filters in HIGH and filters out common_variant, leave "." and "PASS" mutations. The resulting dataframe has 18,129 rows. It is then grouped by gene name to get the total number of high-impact mutations per experiment per gene. This is then pivoted to get a matrix with genes as columns and experiments as rows. Where a gene does not occur in an experiment, it is marked with 0, otherwise 1. So, a binary matrix. I think the experiments each represent individual tissue samples. So what we are seeing is co-occurence of mutations for a tissue sample with a given tumor type. There are 60 samples and 1150 genes. The exercise then seems to be filtering for the first 100 gene names in alphabetical order, to make a picture and get the sense of that picture. It plots the tissue samples on the X axis and the 100 selected genes on the Y axis. The color of the cell then signals how many mutations have occurred for that gene and tissue sample. We save the resulting plot to Synapse: ``` fn='3_wgs_intro_high_impact_map.png' from bokeh.io import export_png export_png(p, filename=fn) test_entity = File(fn, description='Number of high impact mutations per gene', parent=project) test_entity = syn.store(test_entity) ``` The result looks like this: ${imageLink?synapseId=syn20727551&align=None&scale=100&responsive=true&altText=} This has between 1 and 6 sub-clouds of dots, depending on how much you want to aggregate. There is one big cloud and a lot of little clouds. The desirable outcome is a medium number of medium-sized clouds. As commented in the notebook, the high impact map is not actuallythat insightful, and it would have to be 11.5 times longer to cover all of the genes that are affected. The next go-to move is dimensionality reduction with PCA. We collapse the space of 1150 genes into 2 proto-genes and look at the resulting cloud of dots: ${imageLink?synapseId=syn20727552&align=None&scale=100&responsive=true&altText=} This picture is pretty but also insight-free. The notebook then introduces a new scoring function, not just the total number of mutations but something which incorporates the impact of those mutations based on the mysterious IMPACT column. The weighted mutation score is defined as ``` score = genes.totHits * genes.impactScore *(genes.isBenign + genes.isDamaging ``` This results in another multicolored cloud of dots, but with between 4 and 9 medium-sized subclouds: ${imageLink?synapseId=syn20727553&align=None&scale=100&responsive=true&altText=} These analyses are a bit frustrating as we don't seem to be narrowing down to specific action items in terms of drugs to try or gene therapy interventions to try.

Created by Lars Ericson lars.ericson
Thanks, that's very helpful. It turns out there are a lot of articles that directly address PCA, clustering, neural net and network techniques for drug discovery. This seems to be a full specialty on it's own: #### PCA * [Identification of candidate drugs using tensor-decomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and DrugMatrix datasets](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5653784/) * [Drug candidate identification based on gene expression of treated cells using tensor decomposition-based unsupervised feature extraction for large-scale data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2395-8) * [Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0574-4) * [Principal Component Analysis as a Tool for Library Design: A Case Study Investigating Natural Products, Brand-Name Drugs, Natural Product-Like Libraries, and Drug-Like Libraries](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4373534/) * [A New Drug Combinatory Effect Prediction Algorithm on the Cancer Cell Based on Gene Expression and Dose?Response Curve](https://ascpt.onlinelibrary.wiley.com/doi/10.1002/psp4.9) uses PCA to remove outliers #### Clustering * [A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2123-4) * [A Comparative Study of Cluster Detection Algorithms in Protein?Protein Interaction for Drug Target Discovery and Drug Repurposing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6389713/) * [iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting](https://arxiv.org/pdf/1707.00994.pdf) * [An integrative machine learning approach for prediction of toxicity-related drug safety](https://www.biorxiv.org/content/10.1101/455667v1.full) * [Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0060618)
Hi Lars, Absolutely agree with you that a bit of warm up with the data is highly beneficial. This was one of our motivations for preparing the data introductory notebooks and reports instead of just giving access to the raw data. We understand your struggle with the PCA and clustering of the data. In case of biological data, dimension reductionality techniques are generally used to find any oddities/batch effects in the data. Ideally, one would like the data points to nicely separate into tumor vs normal clusters but that is almost never the case! In the RNA-Seq data, since the PCA shows two clusters in the data, one would want to explore what is behind the separation (different tumor-types?, different studies?, different model systems?, activation of different cellular signaling pathways? etc). Genomic variant data is even more heterogeneous and thus even harder to interpret based on PCA. In including these plots in the data introductory documents, our goal was to help participants get started in thinking about the data. At the same time, we are providing some of the code necessary for accessing and playing with the data for anyone who is less familiar with these kind of datasets. Since PCA or clustering is generally one of the exploratory steps in genomic data or drug data analysis, it is seldom reported as a figure in papers highlighting intricacies of cellular signaling causing disease. But, in case you are interested in biological case studies with positive outcomes with PCA or UMAP clustering, please refer to the following 1) https://www.math.umd.edu/~petersd/666/html/iris_pca.html 2) http://resources.qiagenbioinformatics.com/manuals/advancedrnaseq/current/index.php?manual=Principal_component_analysis_plot_2D.html 3) https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
Hi @jineta.banerjee, I'm not trying to be critical. I am new to this field and I find it helps me to get through the material if I talk aloud as I'm doing so. Also I'm trying to plough the field a little bit to the extent that I can, to help other people get up to speed faster. That said it seems like not that many people are trying to get a running start on the weekend. I see one other person on the discussion board so far. For me, there's quite a lot to master, in terms of what Synapse offers, what DT Explorer offers, what other databases are out there, and so on. I guess that most of the other people coming into this will be in the biotech/biopharma field already, and so already conversant in things that are totally new to me. Also the SV.AI Slack space has no activity so far. I guess the party won't really start until it actually starts on Friday night. For an intractable, incurable disease, though, it seems like at least a few days of warm-up are advisable, prior to hacking a cure over the weekend. Also, on the third hand, the application of PCA and clustering in the notebooks seems largely inconclusive to me. I think it would be very helpful to have a worked practical example of PCA and clustering in this domain where the results are positive. For example, in Machine Learning, everybody starts and ends their education with [Fashion MNIST](https://www.pyimagesearch.com/2019/02/11/fashion-mnist-with-keras-and-deep-learning/), which is a successful, graspable, visualizable and intuitive application of machine learning. I think it would be very helpful to cook up a Notebook 5 that covers a successful prior polypharmacological application of PCA, and a Notebook 6 covering a successful polypharm application of UMap and Clustering. It doesn't have to be an NF1 example, just any example of these techniques that was used in practice, say in some published paper, to produce an obvious and clear insight that was a concrete step in discovering a specific drug. So, the Fashion MNIST of polypharm PCA and the Fashion MNIST of polypharm UMAP/Clustering. Thanks, Lars
Hi @lars.ericson It is great to see you engage with and digest the introductory documents for the various datasets and facilitate conversation around them. I would like to highlight the fact that these documents are designed to introduce the quirks of the datasets to the hackathon participants so that the data is a bit more meaningful to them. These are by no means exhaustive analyses that can narrow down specific action items for the participants. Our goal is to give complete freedom of thought and creativity to the participants to come up with their own avenues of analyses. Having said that, at the hackathon (on Friday) we will have experts from the NF field in attendance, and they will probably share a few pointers for interesting avenues for analyses. Hopefully those pointers will help you narrow down your creative analytical ideas.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Digesting the 3-wgs-intro notebook page is loading…