I'm digesting the [1-maseq-intro notebook](https://github.com/Sage-Bionetworks/nf-hackathon-2019/blob/master/py_demos/1-rnaseq-intro.ipynb) and recording a few thoughts as I go along.
It's about [RNASeqs](https://en.wikipedia.org/wiki/RNA-Seq#Applications_to_medicine), which are good for
* **Disease classification.** Identify new disease biology
* **Clinical diagnosis.** Profile biomarkers for clinical indications
* **Drug discovery.** Infer druggable pathways, and make genetic diagnoses
We access the RNASeqs with this line in the notebook:
```
entity = syn.get('syn20449214')
```
This doesn't actually download the data into the docker instance or local machine running the notebook. It creates a [Synapse Schema reference](https://docs.synapse.org/articles/tables.html#defining-a-schema):
```
Schema(modifiedOn='2019-09-06T23:04:30.425Z', etag='47ac2028-d20c-4c1f-bc01-b640f6eec122', modifiedBy='1418096', concreteType='org.sagebionetworks.repo.model.table.TableEntity', columns_to_store=[], id='syn20449214', createdOn='2019-07-08T21:35:21.497Z', columnIds=['79686', '79687', '79757', '79689', '39358', '79690', '69934', '60946', '57594', '78515', '63725', '37198', '37293', '76638', '71826', '76961', '76934', '80366', '80367', '79660', '79696', '79861', '80455', '56628', '80456'], parentId='syn16941818', versionNumber=6, createdBy='1418096', name='Public Harmonized Gene Counts', versionLabel='6')
```
The notebook then queries the table entity, which takes 5 to 10 minutes on a 100MB internet connection.
```
results = syn.tableQuery("SELECT * FROM syn20449214")
genes_with_meta = results.asDataFrame()
```
During this time, I make some sourdough toast and top it with a mixture of [goat cheese, olive oil and peppers](https://www.facebook.com/bluegoatdairy), and freshen up my coffee.
${imageLink?synapseId=syn20718796&align=None&scale=40&responsive=true&altText=}
15GB of RAM is used. Computers with only 16GB of RAM will want to allocate a healthy amount of paging space, say 50GB. The downloaded dataframe has 6,861,460 rows and 25 columns, namely
```
Index(['index', 'totalCounts', 'Symbol', 'path', 'zScore', 'specimenID',
'parent', 'individualID', 'assay', 'dataType', 'sex', 'consortium',
'study', 'diagnosis', 'tumorType', 'species', 'fundingAgency',
'resourceType', 'nf1Genotype', 'nf2Genotype', 'studyName', 'id', 'age',
'isCellLine', 'experimentalCondition', 'transplantationType'],
dtype='object')
```
Two of these columns are dropped, index and path. Index is just an integer row ID. Path is some labelling from the dataset of origin which we probably don't need. That leaves us with 23 columns. To see all of the columns at once, get two monitors, and do this in the notebook:
```
from IPython.core.display import display, HTML
display(HTML(""))
```
The first thing I notice is that we don't see age marked in a lot of these records. This is a pity, because from a signalling point of view, where age in children correlates with changes in hormone expression, age can tell us a lot about cofactors for tumor expression. If I filter for age,
```
genes_with_meta_with_age=genes_with_meta.dropna(subset=['age'])
```
the number of records drops to 1,771,360 rows. I'm interested in which studies record age.
```
genes_with_meta_with_age.studyName.unique()
```
The one study that does is Columbia University NF1 Glioma Project. The studies that don't
```
[x for x in genes_with_meta.studyName.unique().tolist() if x != 'Columbia University NF1 Glioma Project']
```
are
* A Nerve Sheath Tumor Bank from Patients with NF1
* Cutaneous Neurofibroma Data Resource
* Preclinical NF1-MPNST Platform Development
* Synodos NF1 Preclinical Models (Minnesota CCHMC Recombinetics)
* Synodos NF2
* pNF Cell Line Characterization Studies
Getting age into the database should be a priority to model age-related hormone expression change cofactors.
The notebook shows how to plot a 2-category breakdown, by sex and tumor type. This is a really clever use of [Pandas](https://pandas.pydata.org/). Our Pandas skills are really going to shine at the end of this Hackathon!
Next the notebook plots gene expression. We pick a gene, for example SAMD4A, which [we can see from the COSMIC database as being relevant to MPNST](https://www.synapse.org/#!Synapse:syn18666641/discussion/threadId=6005):
```
def show_gene_expression(this_gene):
gene = genes_with_meta[genes_with_meta["Symbol"] == this_gene]
gene.boxplot(column='zScore', by= 'tumorType',
figsize =(8,8), fontsize=12, rot=70, grid=True)
plt.title(f"\n{this_gene}\nExpression across tumorTypes")
plt.ylabel('Expression (zScore)', fontsize = 12)
plt.tight_layout()
plt.savefig(f'1_rnaseq_intro_{this_gene}_expression.png')
show_gene_expression("SAMD4A")
```
We get a plot which shows some response above the mean for MPNST, but not nearly as much as it codes for Glioma:
${imageLink?synapseId=syn20720322&align=None&scale=100&responsive=true&altText=SAMD4A expression}
This gives us an idea: Let's run over all the human genes in this table and look for genes with highest expression for MPNST. There are 345,549 MPNST records. Let's get the average zScore per gene and look for the top 10 genes with highest average zScore. This gives 19,098 records.
```
MPNST=genes_with_meta[genes_with_meta["tumorType"] == 'Malignant Peripheral Nerve Sheath Tumor'].loc[:,['Symbol','zScore']]
MPNST_gene_average_zScore=MPNST.groupby('Symbol').mean()
MPNST_gene_average_zScore.nlargest(10,'zScore')
```
I think this is telling us that for MPNST we should be focusing on [COL1A1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4948583/), [EEF1A1](https://www.nature.com/articles/s41598-018-32272-x) and [AHNAK](https://www.nature.com/articles/s41598-018-32796-2). Let's look at the response of COL1A1 to MPNST versus other tumor types:
```
show_gene_expression("COL1A1")
```
We see that it has a strong association with MPNST, but an even stronger one to Meningioma:
${imageLink?synapseId=syn20720347&align=None&scale=100&responsive=true&altText=COL1A1 and MPNST}
So let's refine our goal. We don't want to just find the gene with the highest zScore for MPNST. We want to find a gene whose highest zScore is for MPNST, and for which that highest zScore is also among the highest for MPNST. The notebook goes on to explain that we should be doing PCA analysis to assess top contributors. So let's try that.
It starts by making a [pivot table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) where the rows are genes and the columns are specimen IDs and the cells are zScores. It drops any row for a gene which is not scored for all specimens. It then transposes the table into a matrix where the rows are specimens and the columns are genes. The specimen IDs contain the tumor type. The space of genes is then reduced to a smaller space of hypothetical proto-genes using PCA. The pivot table calc is kind of slow. The PCA is fast. The specimen labels are added to the rows of the PCA matrix. There are 293 specimens. The specimens are then plotted in PCA space. This gives a pretty but incomprehensible graph, with two clusters. Because we asked for two clusters. We are then invited to interpret this picture.
${imageLink?synapseId=syn20720356&align=None&scale=100&responsive=true&altText=Gene PCA}
I like the idea of PCA but:
* The proto-genes are hypothetical.
* The samples into the two clusters are well mixed.
* I can't make and show 3 or 4 clusters on the plane with this technique, it seems by construction to be glued to the dimension of the plane, which is 2.
* I don't know from looking at this what I should read for a single complication such as MPNST.
The next notebook, 2-drug-screening, has some more advanced clustering techniques to digest and then possibly bring back to the topic of MPNST in this notebook.