Hello, My team and I have been attempting to benchmark our algorithm, but we seem to have run into an interesting conundrum when scoring triplets correct. It seems to primarily arise when there are multiple samples with the same label. For example (i.e. sub1 train 6), where the samples are: c1 : 2100111000 c2 : 2100101000 c3 : 2100101000 c4 : 2100101000 The reference tree is: ``` /----------------------------------------------------------------------------------- c1 /----------------------------------------------------------------------------------+ | \----------------------------------------------------------------------------------- c2 + | /----------------------------------------------------------------------------------- c3 \----------------------------------------------------------------------------------+ \----------------------------------------------------------------------------------- c4 ``` Our reconstructed tree is: ``` /----------------------------------------------------------------------------------- c2 | /----------------------------------------------------------------------------------+----------------------------------------------------------------------------------- c3 | | + \----------------------------------------------------------------------------------- c4 | \---------------------------------------------------------------------------------------------------------------------------------------------------------------------- c1 ``` And as a result, the triplets correct score is ~25%. I suppose this brings up two questions: 1) How are these trees w/ duplicate samples going to be dealt with when evaluating our results via TC/RF 2) Do you make assumptions about tree format (I.e. binary only) that may cause our results to significantly tank if we don't abide by? Thank you. Best Regards, Alex

Created by Alex Khodaverdian alexkhodaverdian
Hi Alex, Sorry for pitching in late. I would like to expand on this a bit and clarify the triplets scoring in the case of your example. Your reconstruction is correct in respect to the mutation data and is penalized for it. While this is frustrating, there are likely many cases in all subchallenges where the reference trees capture information not encoded in the mutation signatures. We could try to mediate obvious cases like the one in this example but more complex cases would remain hidden. Diving into the implementation details of the triplet scoring. In this case its easy to list all possible triplets and consider their situation. First I'll just clarify the trees in the example as I understood them: ``` import dendropy ref = dendropy.Tree.get_from_string('((c1,c2),(c3,c4));', schema='newick') rec = dendropy.Tree.get_from_string('(c1,(c2,c3,c4));', schema='newick') print(ref.as_ascii_plot(width=30)) print(rec.as_ascii_plot(width=30)) /------------- c1 /------------+ | \------------- c2 + | /------------- c3 \------------+ \------------- c4 /-------------------------- c1 | + /------------- c2 | | \------------+------------- c3 | \------------- c4 ``` triplet status c1 c2 c3 Incorrect c1 c2 c4 Incorrect c1 c3 c4 Correct c2 c3 c4 Ambiguous The score here is as good as a random tree which is fine since you can't really do better with the provided mutation data. TreeCmp captures this in the triplets score (a score of 0 indicates matching trees while a score of 1 indicates a random tree, greater then 1 scores, worse-than-random, would not be additionally penalized compared to 1.0): Common_taxa 4 R-F 0.5 R-F_toYuleAvg 0.7092 R-F_toUnifAvg 0.7257 Triples 3 Triples_toYuleAvg 1.1136 Triples_toUnifAvg 1.0929 This normalization can be reproduced by considering the number of all possible triplets with 1/3 chances of a random triplet being correct: import scipy.special triples = 3 n_taxa = 4 triples_score = 3*triples/(2*scipy.special.comb(n_taxa, 3, repetition=False)) triples_score = 1.125 Best, Ofir
Hi Alex, indeed there is no way to distinguish between them what I meant is that the training set gives you examples of how you can get to twin cells and that this information might be useful to reconstruct a tree. But you are right at the end identical cells are just that. As alejandro indicated and depending on the results we will take account of this. thanks P
Hi Alex, indeed there is no way to distinguish between them what I meant is that the training set gives you examples of how you can get to twin cells and that this information might be useful to reconstruct a tree. But you are right at the end identical cells are just that. As alejandro indicated and depending on the results we will take account of this. thanks P
Hey Pablo, Thanks for following up. Just to clarify, if we have two cells with the exact same character string, by definition there is no way to distinguish between them, and thus there is no way to prefer where to place one over the other. Are we missing something that would be able to break the arbitrary nature of where to place these cells? Best Regards, Alex
Dear Alex, one more thought, twin signatures for cells are as Alejandro indicated a scoring problem, but given the training data it is also part of the problem to understand how cells reach the same signature and maybe discover an underlying pattern/mechanism. P
Hi Alex, In this dataset, there are some cells with identical labels, which means that during the experiment, they inherit the same barcode with no additional edits before the experiment stopped. When scoring the trees we will calculate the RF against the ground truth, considering also the duplicated labels. This means that perfect reconstruction is impossible since there is an ambiguity that can't be possible resolved just from the barcodes. So for submission, just send your reconstruction including the duplicates. Best
Hi Alex, sorry I meant that the way we calculate the scoring will be dealt the same for TC in the case of non-binary trees. Cheers Pablo
Hello, Thank you for getting back to me in regards to this concern. I believe TC is affected by non-binary trees. For example: in the first tree, c3 and c4 are more closely related than c2, whereas in the second tree, c2 c3 and c4 are all equally related. Therefore the triplets result would be different. Am I missing something here? Best Regards, Alex
Hi Alex, thanks for your interest in the challenge, your 2 questions are related to which tree format is accepted and indeed non-binary trees are accepted and will be scored using TC/RF. TC is not affected by non-binary trees, for RF the score is also well defined in the case of non-binary trees when adding dendropy's: reconstructed_tree.suppress_unifurcations() followed by rewriting of the reconstructed tree to our standard scoring pipeline. It could pack some ambiguity in some scores but this will be the same for every participant so it should not be a problem. Best

Triplets Correct Accuracy w/ Duplicate Samples page is loading…