Hi, I noticed that the provided ccle_seq data is different from the ones available at Depmap data portal. For example, many genes in the KRTAP family are named in different ways. There is only one "KRTAP10" gene available in the organizer provided table ( I doubt KRTAP10 is a correct gene name). In comparison in the original file, there are 12 corresponding genes named as KRTAP10-xx, which according to Ensemble, are different genes. Could the organizer confirm if there are preprocessing conventions that lead to this difference, or maybe I miss something? Thanks! BR, Wenyu

Created by Wenyu Wang WenyuWang
Dear @CTDsquaredPancancerChemosensitivityDREAMChallengeParticipants, As noted in [this thread](https://www.synapse.org/#!Synapse:syn21763589/discussion/threadId=7090), some of the HUGO gene identifiers were truncated in the original version of the CCLE RNASeq training data that was provided. This has been fixed. An updated version of this dataset is available at syn21822697.2. We recommend you re-download the data and use the latest version. Thanks for your understanding! Best, Robert On behalf of the CTD^2^ Pancancer Chemosensitivity DREAM Challenge Organizers
Interesting! Thanks for sharing Wenyu!
I just realized that the way I removed the entrez id's accidentally truncated some of the hugo-ids including the KRAP genes. I have just fixed this problem and I will work with sage to update the file and sent out an alert to participants. Thanks for pointing this out Wenyu! -Eugene
Indeed! I also noticed the lacking of seq data for this A673 in the Q2 release. Here are the explainations from the Depmap organizers: "This cell line was removed because it failed some RNAseq QC metrics related to total number of unexpressed transcripts, but after some reevaluation we have decided to add it back for the time being. So soon we will upload a revision to the portal which will include this cell line."
Actually, the Q2 release also doesn't contain expression values for one of the 515 cell lines from the competition, cell line ACH-000052 or A673 so using the current DepMap data (without noticing the omission) will affect scoring! :)
I see, thanks for the clarification!
Hi Wenya, The only pre-processing that was done was: (1) removal of entrez id from feature names (as PANACEA data uses only hugo ids) (2) filtering of 515 cell-lines that overlap with Achilles shRNA data set The major difference between the current DepMap matrix and the version we provided was** 20Q1 (2020 quarter 1 data set 300MB)** but the DepMap people have updated that data set to **20Q2 which is around 370 MB.** We only provided this preprocessing data for convenience to give you all and idea of the types of data you might used. You should feel free to use the current DepMap data if you prefer as that doesn't affect scoring at all. Put simply, we provided an earlier version of the RNAseq data and it looks like the data set has changed considerably w/n the last month (because DepMap does updates quarterly) -Eugene

Problematic gene names in provided ccle_seq data? page is loading…