The Sanger Catalog of Somatic Mutations in Cancer has 30,429 records for NF1. It is available [downloadable in CSV format](https://cancer.sanger.ac.uk/cosmic/download), after filling out an account registration. Of the 30,429 records, 1880 have confirmed NF1 mutation sites and confirmed primary histologies. **Question for Synapse admins:** Can you ask Sanger to allow you to host the whole Sanger database within Synapse, so we can use it as a source in a provenance graph? It contains columns for Mutation Coding DNA Sequence and Histology. There are 35 identified primary histologies for the 1880 records with mutation sites: * adnexal_tumour * adrenal_cortical_adenoma * adrenal_cortical_carcinoma * angiosarcoma * carcinoid-endocrine_tumour * carcinoma * dermatofibrosarcoma_protuberans * fibroepithelial_neoplasm * fibrosarcoma * ganglioneuroma * gastrointestinal_stromal_tumour * germ_cell_tumour * glioma * glomus_tumour * haemangioblastoma * haematopoietic_neoplasm * leiomyosarcoma * liposarcoma * lymphoid_neoplasm * malignant_fibrous_histiocytoma-pleomorphic_sarcoma * malignant_melanoma * malignant_peripheral_nerve_sheath_tumour * mesothelioma * neuroblastoma * neurofibroma * osteosarcoma * paraganglioma * pheochromocytoma * rhabdomyosarcoma * sex_cord-stromal_tumour * solitary_fibrous_tumour * thymic_carcinoma * Wilms_tumour We can then index the 1880 mutation sites against the 35 observed complications. For example, site c.7474C>T has two associated complications, fibroepithelial_neoplasm and carcinoma. The following Python program takes the Sanger NF1 CSV file V89_38_TARGETEDSCREENMUTANT.csv and produces a picture which indexes the above 35 complications on the Y axis and the observed mutation sites on the X axis, to give a visualization of what complications are associated with what mutations. Here is the code: ``` import os, re import pandas as pd from matplotlib.pylab import * %matplotlib inline data_dir=os.getenv('NF2_DATA') df=pd.read_csv(f'{data_dir}/V89_38_TARGETEDSCREENMUTANT.csv') df=df.loc[:,[' PRIMARY_HISTOLOGY',' MUTATION_CDS']].dropna() df.columns=['histology', 'mutation'] df=df[~df.mutation.str.contains("\?")] df=df[df.histology != 'other'] df.mutation = df.mutation.apply(lambda x: int(re.search('[0-9]+', x).group())) dd=df.to_dict() histology_i={hist:i for i,hist in enumerate(sorted(set([y for x,y in dd['histology'].items()])))} df.histology = df.histology.apply(lambda x: histology_i[x]) df.plot(x="mutation", y="histology", kind="scatter", figsize=(10,8)) savefig('gwas.png') ``` and here is the resulting picture: ${imageLink?synapseId=syn20710314&align=None&scale=100&responsive=true&altText=Histology index on Y axis and Mutation site on X axis} We see some interesting bands where some complications are co-occurring with many somatic mutations in neurofibromin 1. A bit more code ``` histo_count={y:0 for x,y in dd['histology'].items()} for x,y in dd['histology'].items(): histo_count[y] += 1 list(reversed(sorted([(y,x) for x,y in histo_count.items()])))[0:10] ``` tells us that the top 10 complications co-occuring with a somatic mutation of NF1 are: ``` [(803, 'carcinoma'), (267, 'haematopoietic_neoplasm'), (205, 'neurofibroma'), (200, 'malignant_melanoma'), (167, 'glioma'), (57, 'pheochromocytoma'), (34, 'lymphoid_neoplasm'), (20, 'malignant_peripheral_nerve_sheath_tumour'), (19, 'Wilms_tumour'), (15, 'germ_cell_tumour')] ``` Since this is a cancer database, some or all observed complications not usually associated with NF1 are most likely due to factors on other genes. We now need to proceed as follows: * Download the whole Sanger database (I just downloaded a slice, as a first cut), and then look for correlation of these top 10 complications with mutations on other genes, for the same patients * Conditional on a complication not having a strong, independent association with another gene, we can conclude that neurofibromin 1 plays a role. * Where neurofibromin 1 plays a role, what are the common receptor sites or other factors which make these complications commonly susceptible to a mutation in neurofibromin 1? **Question for Synapse admins: ** What is the best practice to go from the CSV fished from an external site, to the Python program, to the graphical result, and store those in Synapse in a provenance-compliant way?

Created by Lars Ericson lars.ericson
58 of 59 mutation sites have a single histology, so the histology etiologies are largely independent. ``` L=nf1h.to_numpy().tolist() histo_per_mutation={x:set([]) for x,y in L} for x,y in L: histo_per_mutation[x].add(y) histo_per_mutation=[(x,len(histo_per_mutation[x])) for x in histo_per_mutation] [x for x in histo_per_mutation if x[1]>1] ``` That is, they do not share action of the mutated protein. Then the question becomes: For each histology, how do the mutated proteins of different mutation sites associated with a histology act similarly to produce that histology?
We find 62 records in COSMIC of NF1 mutations associated with the above 5 complications. ``` nf1_histologies=[x[1] for x in gene_histologies if 'NF1' in x[0]] inf1_histologies=[histology_i[x] for x in nf1_histologies] inf1_histologies_j={x:i for i,x in enumerate(inf1_histologies)} j_inf1_histologies=inverse(inf1_histologies_j) i_nf1_histology={i:histologies[j_inf1_histologies[i]] for i in j_inf1_histologies} nf1h=df[df.histology.isin(inf1_histologies)] nf1h=nf1h[nf1h.gene==gene_i['NF1']] nf1h=nf1h.loc[:,['mutation', 'histology']] nf1h.histology=nf1h.histology.apply(lambda x: inf1_histologies_j[x]) vc=nf1h.copy() vc.histology=vc.histology.apply(lambda x: i_nf1_histology[x]) VC=vc['histology'].value_counts() VC['TOTAL']=VC.sum() VC ``` giving ``` pheochromocytoma 30 neuroblastoma 13 rhabdomyosarcoma 7 neurofibroma 6 germ_cell_tumour 3 TOTAL 59 Name: histology, dtype: int64 ``` We can plot the histologies against mutation sites to see if there is common action across histologies and get a visual idea of frequency: ``` i_nf1_histology[-1]='' i_nf1_histology[6]='' ax=nf1h.plot(x="mutation", y="histology", kind="scatter", figsize=(10,8)) title('NF1-associated histologies NF1 somatic mutation sites', fontsize=12, fontweight="bold") vals = ax.get_yticks() ax.set_yticklabels([i_nf1_histology[x] for x in vals]); ylabel('HISTOLOGY', fontweight="bold") xlabel('NF1 MUTATION SITE', fontweight="bold") tight_layout() ``` giving ${imageLink?synapseId=syn20718121&align=None&scale=100&responsive=true&altText=} To see which histologies occur most frequently, we can make a bar chart ``` matplotlib.rc('xtick', labelsize=12) matplotlib.rc('ytick', labelsize=12) figure(figsize=(8,6)) ax=nf1h['histology'].value_counts(normalize=True).plot(kind='barh'); vals = ax.get_xticks() ax.set_xticklabels(['{:,.0%}'.format(x) for x in vals]) title('NF1-associated histologies frequency', fontsize=12, fontweight="bold") vals = ax.get_yticks() ax.set_yticklabels([i_nf1_histology[x] for x in vals]); ylabel('HISTOLOGY', fontweight="bold") xlabel('% OCCURRENCE', fontweight="bold") tight_layout() ``` giving ${imageLink?synapseId=syn20718120&align=None&scale=100&responsive=true&altText=}
The idea with COSMIC was to look at all the reported tumor types reported for patients with NF1. Then for those tumor types, across all patients with NF1, look at the gene most frequently associated with that tumor type. Then look only at the tumor types most frequently associated with NF1. Then look how many NF1 mutations are associated with those tumor types. Then, for the tumor types that are associated with the largest number of NF1 mutations (there about 2400 observed mutations), ask, at a signalling level, what is in common for those tumors. We are looking for a novel, well-defined understanding of how NF1 acts to create tumors, and what is the full [signaling pathway](https://pdfs.semanticscholar.org/0b81/13019b86c3dff0b85f643577305a620ae598.pdf), and then search at a chemical level for the most like agonists or antagonists. We did a first pass at the last part of the exercise above, but it confuses too many histologies with NF1. So this is our revised approach not to do that. This will give us a more selective list of histologies and then we can repeat the exercise above with those. So here's what we do to get our more selective list: ``` import os, re import pandas as pd from matplotlib.pylab import * from ipy_table import * %matplotlib inline data_dir=os.getenv('NF2_DATA') df=pd.read_csv(f'{data_dir}/CosmicGenomeScreensMutantExport.tsv', sep='\t') df=df.loc[:,['Gene name','Mutation CDS', 'ID_sample', 'Primary histology']].dropna() df.columns=['gene', 'mutation', 'sampleid', 'histology'] df=df[~df.mutation.str.contains("\?")] df=df[df.histology != 'other'] df.mutation = df.mutation.apply(lambda x: int(re.search('[0-9]+', x).group())) df=df[df.histology != 'NS'] nf1=df[df.gene=='NF1'] dd=nf1.to_dict() histologies=set([y for x,y in dd['histology'].items()]) histology_i={hist:i for i,hist in enumerate(sorted(histologies))} samples=set([y for x,y in dd['sampleid'].items()]) df=df[df.histology.isin(histologies)] df=df[df.sampleid.isin(samples)] bd=df.to_dict() genes=set([y for x,y in bd['gene'].items()]) gene_i={gene:i for i,gene in enumerate(genes)} n_genes=len(genes) n_histologies=len(histologies) df.gene = df.gene.apply(lambda x: gene_i[x]) df.histology=df.histology.apply(lambda x: histology_i[x]) gwas=pd.crosstab(df.gene, df.histology) G=gwas.to_numpy() hottest_gene_per_histology=np.argmax(G,axis=0) def inverse(D): return {v:k for k,v in D.items()} i_gene=inverse(gene_i) i_histology=inverse(histology_i) gene_per_hist=[i_gene[x] for x in hottest_gene_per_histology] histologies=[i_histology[x] for x in range(len(i_histology))] make_table([['Gene', 'Histology']]+sorted(list(zip(gene_per_hist, histologies)))) apply_theme('basic') ``` The resulting table is as follows: ${imageLink?synapseId=syn20717723&align=None&scale=60&responsive=true&altText=Most frequently cited gene for histology} So now we have this smaller list of histologies that seem like NF1 is likely to be causative: * [germ_cell_tumour](https://en.wikipedia.org/wiki/Germ_cell_tumor), associated with gonads * [neuroblastoma](https://en.wikipedia.org/wiki/Neuroblastoma), associated with adrenal gland * [neurofibroma](https://en.wikipedia.org/wiki/Neurofibroma), associated with myelin covering nerves * [pheochromocytoma](https://en.wikipedia.org/wiki/Pheochromocytoma), associated with adrenal gland * [rhabdomyosarcoma](https://en.wikipedia.org/wiki/Rhabdomyosarcoma), associated with undeveloped strial muscle cells The transition to MPNST seems more associated with SAMD4A than with NF1, which is interesting. So the next thing to ask is, for these 5 complications, how many NF1 mutation sites are associated with each complication.
One other item of note in terms of provenance and reproducibility: When I try to read in a 10GB COSMIC file as a dataframe on my desktop PC with 32GB of RAM, it works. If I try it on my laptop with 16GB of RAM, it fails. So that kind of tells us that unless we are on a well-provisioned Google Cloud instance or a well-provisioned remotely accessed desktop, some of these calcs will not be practically reproducible. That's a minor sidepoint, just a practical matter in provenance and reproducibility. Also it tells us that for the Hackathon, the Google Cloud VMs that people spin up should be large, not small.
I'm back to thinking about the problem. I looked at [GWAS](https://en.wikipedia.org/wiki/Genome-wide_association_study) on Wikipedia and it seems convoluted. If I look at the whole database, there are 2175 samples with NF1 somatic mutations, co-occuring with 20 complications. There 56,048 genes listed but on average only 5290 genes have records for each NF1 sample. Put another way, each NF1 sample has on average 5,290 co-occuring mutations for 20 complications. There is in the Wikipedia page a complicated approach to assessing that a particular mutation leads to a particular complication, which only gives what is considered to be a weak signal. I'm trying to do this in a simple way because I want to get a rough idea of what complications might share NF1 as a cause and what are probably due to a different gene. Because we really only care about the NF1 gene, we would only drill down on complications per mutation for the gene after we have ruled out complications that truly belong to other genes. A simple approach is to have a matrix which starts for each gene-complication pair with a level of 0. So rows are genes and columns are complications. Then for each tissue sample record for a gene-complication pair, increment the corresponding cell of the matrix by 1. Then for each complication, find the gene which has the highest count. Let that gene "own" the complication. Then, for the complications that NF1 owns, we would drill down further to associate particular mutations on the gene with those complications, by repeating the process for gene-mutation site pairs.
Hi Lars, To answer your question - no limit per user on synapse. I'm sure an incredible amount will raise some eyebrows on our platform team, but I've uploaded several terabytes myself, so I wouldn't worry about it! If you're looking to dockerize your analysis using large, restricted databases like COSMIC that don;t have a simple credentialing system like Synapse's, you could consider (yet another) alternative approach - create an empty "data" folder in your container, and then mount your data to the container when you run it (using the `-v` flag in`docker run` command, something like `docker run -p 8888:8888 -v $PWD/data:/home/jovyan/work/data nfosi/nf-hackathon-2019-py`, where you are mounting the "data" folder in your working directory to a "jovyan/work/data" directory in the container. Then any code run in your container can simply source the data from the "data" folder. The command on a windows system might vary - I don't know about this) This is a good way to distribute an analysis without distributing restricted data. The idea would be that anyone who wants to run your analysis has to have their own copy of the COSMIC (or whatever database you want) locally, so they can retrieve it under whatever license they would need to, and then mount it into the container themselves when running it. Since the datasets we are providing are not that large and can be easily accessed using synapse credentials (take a few minutes to download max), creating credentials in the container is a nice simple solution. The other approach (mountinng the data in) is a little more complicated and requires more instructions to get others using your container up and running but could be worth it for large datasets with nebulous licensing. Just a thought!
I looked at it a little more. COSMIC has a [complicated scripted download protocol](https://cancer.sanger.ac.uk/cosmic/help/file_download). The full file is 2.4GB compressed and takes 10 minutes or so on high speed internet to download. Uncompressed it is 10GB. This makes the idea of a URL link where the file is re-uploaded by a Docker running a particular analysis seem impractical. I guess I will upload the compressed file in private mode for use only with my team, if I can set the permissions that way. The Docker can uncompress it before use, as that doesn't take too long. I will work locally first to actually do my calc, and then do the dockerizing and private upload after if I get useful results. This would be mostly to practice dockering things. I assume that you have fairly generous storage limits, because Docker files are large, and these databases are large, so it's fairly easy to rack up the gigabytes. Is there a storage limit per user on Synapse?
I guess the thing to do is use the external URL solution. I was asking because I had to push buttons 3 times for the Hackathon to ask for access to external permissioned datasets. So what I was hoping you would do is call COSMIC and ask to make a cut of their whole database be an external permissioned dataset with it's own green button for the purposes of the Hackathon. Alternatively, what would be great is if you can point me to data within our permissioned data which does the same thing as the COSMIC database: Give a table with gene, mutation site, and tissue histology. A bonus column would be age of the patient, so we can determine average age of onset. We also need not just NF1 mutations but all mutations, because as I observed above, everybody in the database having carcinoma could be selection bias (all the patients in COSMIC having carcinoma), rather than a consequence of having an NF1 mutation. So we need to see each patient's whole genome mutations to factor out complications which are more attributable to another gene than to NF1.
I am not a lawyer, of course, so take this with a grain of salt, but I would assume that your usage falls under the academic use case (or more accurately, personal use, but I don't think that COSMIC explicitly addresses that). I would imagine that if you tried to commercialize something based on your findings , that would be the point at which you'd need to get a commercial license for these data. I see now that COSMIC also has periodic releases (currently v90), so if you are able to reference a specific release directly, that might be the most straightforward way of including it in provenance. Sorry for the confusion; it's been a while since I have used COSMIC. You can still upload the data on synapse for your personal analysis use, but probably cannot make it widely downloadable as that would be considered redistribution of the data under their license. Licensing of these things can be a headache, and is a reason I find the noncommercial licenses to be a particular pain. When possible, I try to license as liberally as possible (for example, the content we produced for the hackathon we licensed [CC0](https://www.synapse.org/#!Synapse:syn18666641/wiki/595045)).
Thanks for the recommendation. Per accepted usage, I will upload the data file, put the above calcs into a Jupyter notebook, Dockerize it, and then store it in Synapse with a provenance chain. I don't have an academic email. I asked COSMIC for academic access. COSMIC's website says it will bill my company if it hears that I got any use out of the data. I put in my registration that it was for the Hackathon. Synapse is a non-profit. The only people seeing the data will be academics looking at it in context of provenance. So I think that's OK. I just didn't want to break any terms of service and get Synapse in trouble. Doing [hard science](https://en.wikipedia.org/wiki/Hard_and_soft_science) in bio-pharma is a novel experience for me, so I don't know what accepted usage is. ![Hard Science](http://cf.broadsheet.ie/wp-content/uploads/2016/03/tumblr_o3vdbmxLBx1rwkrdbo1_500.jpg)
Hi Lars: You have two options - one is to upload the original data to your project, creating a synapse entity that you can reference in all of your analyses. This is a good solution because the COSMIC database regularly changes, so what another person gets in a year could look different than what you download today. The other option is to reference an external URL in the provenance, which works fine too: e.g. ``` data_file = File(path="random_numbers.txt", parent="syn1901847") data_file = syn.store(data_file, executed="syn7205215", used="http://mathworld.wolfram.com/NormalDistribution.html") ``` (example from https://docs.synapse.org/articles/provenance.html) Note that in this second example, Synapse will not track changes in the 'used' URL, so this is generally a better solution when you control the external link as well (such as a github stored script) or can reference a specific version (e.g. ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25/ or something)

Histologies co-occurring with somatic NF1 mutations, from the Sanger COSMIC database page is loading…