Dear AMP-AD administrators, We recently developed a machine learning framework that evaluates potential associations between gene sets and AD: https://www.biorxiv.org/content/10.1101/2020.05.15.098749v1 In brief, the framework evaluates an input gene set for its ability to predict the Braak stage in AMP-AD datasets. The gene set of interest is then compared against a background distribution of randomly-selected gene sets to establish statistical significance. In the paper, we used the framework to evaluate gene sets arising from drug perturbations, thus generating hypotheses about drug repurposing. However, the framework is general enough to allow the evaluation of ANY gene set. To that end, we would like to put together an R/Shiny web app, where users can submit their custom gene sets and get a score for how predictive of Braak stage that gene set is. To streamline the process, it would be helpful to have AMP-AD datasets cached on the back end, thus allowing for a quick application of our machine learning methods. My question is whether this is allowed by the Data Usage Agreements. Note that the end users of our app will never have access to raw AMP-AD data through our tool. They will only ever see a score of how good their gene set is at recognizing the Braak stage in these data. Thank you, Artem Sokolov, PhD Director of Informatics and Modeling Laboratory of Systems Pharmacology Harvard Medical School

Created by Artem Sokolov ArtemSokolov
Hi Mette, In our paper, we actually kept the three datasets separately. We even further subdivided MSBB by the four regions, because we were concerned about the impact of dataset-specific and region-specific effects on the predictor accuracy. Our Supplementary Figure 1 (page 38 in https://www.biorxiv.org/content/10.1101/2020.05.15.098749v1.full.pdf) shows that there is some difference in the amount of signal present in each dataset, when it comes to predicting disease severity. In particular, we noticed that it was remarkably easy to identify late-stage (Braak 5, 6) samples in the Mayo dataset with any arbitrary set of genes. I think this raises some concerns about possible batch effects. So, in general, I would be a little hesitant to aggregate everything into a single common dataset. However, if the samples are properly annotated with the original dataset / region, it can always be subsequently split back up by users, should they detect any potential batch effects. With that said, it's always a good idea to process everything with a uniform pipeline, even if you don't end up aggregating all samples together. I may be able to join your call, depending on when it is. Best, -Artem
We are in the process of applying a uniform pipeline to the ROSMAP (including the recent data releases), MSBB, and MayoRNAseq data. That will include normalized counts. This will generate a much larger dataset. Are you interested in a call to discuss how we may make this work?
Hi @Mette , For the moment, we are looking to cache the gene expression data (e.g., syn3505720 for ROSMAP) and the Braak stage scores (e.g., syn3191087). We discussed generalizing our framework to other data types, such as Mass Spec proteomics, but this is not yet implemented. To reiterate, the end user will never access the cached data directly. We will be more than happy to contribute our R/Shiny app to the computational tools portal. Thank you, -Artem
This sounds like a very interesting app. What assays and level of data are you looking for? If developed, would you be willing to share it through the [AD portal computational tools](https://adknowledgeportal.synapse.org/Explore/Computational%20Tools)?

Using AMP-AD data in a shiny app backend page is loading…