Does anyone have tips for combining cell lines? Currently, I am treating each binding site from each cell line as an i.i.d. sample, which I have been told is not very optimal. Ideally, I would weigh each cell line. Does anyone know how substantial the gain in performance is from intelligently combining cell lines?
Created by Daniel Quang daquang Got it. Thanks for the advice! Dear Rajiv,
My opinion is based on two observations.
First, the cell-type similarities are not consistent when comparing DNase-accessibility profiles, overall gene expression or expression of transcription factors only. So, the general 'similarity' is not trivial to define based on a single feature.
Second, enrichment of different features in 'B' regions is not consistent between different transcription factors.
Thus, a proper similarity metric between cell types might be TF-specific (see e.g. LR+ plots of different features for ATF2 binding in different cell types in our write-up;
although this particular example might be somehow affected by 'bad' data, since ATF2 was withdrawn from the phase2).
No, we haven't tested the 'DNase similarity only' approach (the leaderboard submission might be not strictly necessary since there are multiple and quite different cell types in training). Ivan, you mention in your write-up that solely using DNase similarity or gene expression to determine which cell lines to train on wouldn't work as well. Can you elaborate a little more on why you think genome-wide DNase correlation specifically wouldn't be a good metric to use? For example, do you have any evidence supporting Daniel's example: "We could have a scenario where cell line D shares a more common gene expression pattern and Factor Y binding with cell line C, but shares more common Factor X binding sites with cell line A." Or, if you tried leaderboard submissions while training on the top 1-2 cell lines just based on DNase similarity, did that not work as well as training on your curated cell lines? Indeed, the cell type similarity is TF-dependent.
We haven't found a robust way to estimate the similarities quantitatively (which is necessary for properly weighted aggregation over the cell types),
and used only 'the most similar' cell types for training.
This might be a limitation of our naive' prediction method so (I believe) there should be a better strategy. Thank you for your suggestions. At first, I wanted to weigh contributions from each cell line according to DNase or gene expression similarities to the testing cell line, but then I realized I would come across the issue that I would be using the same weights for each factor. For example, suppose Factors X and Y both have testing cell line D and training cell lines A, B, and C. If, based on gene expression I get the weights 1, 2, and 3 for the three training cell lines, respectively, I would have to use these same weights for both Factors X and Y. We could have a scenario where cell line D shares a more common gene expression pattern and Factor Y binding with cell line C, but shares more common Factor X binding sites with cell line A. I believe autosome.ru mentioned some along these lines before. I wonder if there's a smart way to weigh each cell line differently for each factor. Hi Daniel,
We (autosome.ru) were selecting a limited set (1-2) of the most similar cell types (based on predicted feature enrichment).
This is not anyhow a perfect way, and a proper cell type merging should be more stable.
Personally I think both Guan's lab "average over cell types" and David's (davidaknowles) "PCA on gene expression" are quite reasonable,
but can be improved taking into account DNase similarity or deeper inspection of how different features contribute to the predictions.
Best,
Ivan This is the main trick that the top scoring team in Round 1 (autosome.ru) uses. They use several tricks to match and weigh training and test cell types. You would have to ask them for more details.