Hi!
When will the scoring scripts, to compute the relevant metrics on the Validation data, be made available?
Thanks
Sanjit
Created by Sanjit Batra sanjitsbatra There is no -nth for the scoring script
I couldn't use python multiprocessing since memory requirement was too large.
Please do multithreading in your shell script (https://unix.stackexchange.com/a/216475).
I used the following trick to score all Round 1 submissions.
```bash
N=4
(
for thing in a b c d e f g; do
((i=i%N)); ((i++==0)) && wait
task "$thing" &
done
)
```
Hi Seth,
@jseth It seems that I can not find the -nth parameter in the scoring script, how to enable multi-threading for scoring.
Thanks,
Chenyang Hi Sanjit.
I don't know which specific bigwig files you're referring to, but generally bigwigs won't have a line covering 0 values at the end of a chromosome. I don't know why this is the standard format, but it's something I've fairly consistently seen.
As for the scoring script, I'll bring it up with the others and get back to you.
This is a gentle reminder for the above two queries. One more question: GWSpear and MSEvar are on the list of metrics on the Wiki but are not in the scoring script. Also, the scoring script has match1, catch1obs, catch1imp, aucobs1, aucimp1, which are not mentioned on the Wiki. Is there a reason for this change? Thank you! This was an error on my part.
One additional question I had was: the last position on each chromosome, for instance on chr1 in C51M22 is 248933981 instead of 248956422 (+- 1). I have noticed in multiple tracks on multiple chromosomes that the last position covered is not the same as the length of the chromosome from hg38. Is there a reason for this? Some of our methods require the chromosome lengths in the input bigwig files to be consistent and I was wondering if we could simply add a line with a 0 value to make sure this is the case. (base) [ec2-user@ip-172-30-0-12 tmp2]$ python ~/code/on_master/imputation_challenge/score.py /mnt/imputation-challenge/data/evaluation_data/avocado/C03M02.p8.bigwig /mnt/imputation-challenge/data/validation_data/C03M02.bigwig --chrom chr22 --gene-annotations ~/code/on_master/imputation_challenge/annot/hg38/gencode.v29.genes.gtf.bed.gz --enh-annotations ~/code/on_master/imputation_challenge/annot/hg38/F5.hg38.enhancers.bed.gz
[2019-05-14 00:25:20,991 INFO] ['/home/ec2-user/code/on_master/imputation_challenge/score.py', '/mnt/imputation-challenge/data/evaluation_data/avocado/C03M02.p8.bigwig', '/mnt/imputation-challenge/data/validation_data/C03M02.bigwig', '--chrom', 'chr22', '--gene-annotations', '/home/ec2-user/code/on_master/imputation_challenge/annot/hg38/gencode.v29.genes.gtf.bed.gz', '--enh-annotations', '/home/ec2-user/code/on_master/imputation_challenge/annot/hg38/F5.hg38.enhancers.bed.gz']
[2019-05-14 00:25:20,991 INFO] Opening bigwig files...
[2019-05-14 00:25:20,992 INFO] Reading from enh_annotations...
[2019-05-14 00:25:21,071 INFO] Reading from gene_annotations...
[2019-05-14 00:25:21,127 INFO] Scoring for chrom chr22...
[2019-05-14 00:26:59,612 INFO] y_true_len: 2032739
[2019-05-14 00:28:42,937 INFO] y_predicted_len: 2032739
chr22 13.61572903980927 1308.894128158828 1087.0741364812238 0.8094923476033987 11361 15349 17951 0.9169686688249004 0.9661489947344569 117.23225772469586 19.773003904918255 186.69533827574193
[2019-05-14 00:28:45,678 INFO] All done.
I got 13.62 with the script on master branch. To give credit where it's due, it was Jin who ran the scoring scripts. I only looked at the results. Hi!
Jacob and I are getting inconsistent MSE values for some of the datasets. For instance, for the C03M02.p8 dataset, Jacob obtained an MSE of 13.62 on chr22, while I obtained an MSE of 859.06, using the official scoring script. I was wondering why this might be happening.
Thanks
Sanjit 1. The `y_all` aspect of mseVar should be calculated over all training cell lines for that assay. You're essentially saying that you want to weight the errors more heavily when they come from positions that do vary across cell types, rather than those that don't.
2. I must've missed that function in the scoring script. You're right that it'll just be `scipy.stats.spearmanr(y_true, y_pred)` though. Hi,
I would like to ask something related the score.py.
Compare with 3.5 - Prediction Scoring Metrics and Ranking, it lack for mseVar and gwspear. Will you add these two functions?
1. For mseVar, it already exist but it needs y_all for calculating the weights across all cell types. How to build y_all, how to combine all assay types to one cell type? Or it just like the original form, for each cell_type and assay_type got a vector, i.e. y_all: numpy.ndarray, shape=(n_positions, n_samples), n_samples means the number of samples that we have in validation data sets.
2. For gwspear, I think we can just call the existed function in python... Both the bigwig and the .bedgraph.gz files are being uploaded now. Let me know if you run into any issues, and sorry for the oversight on the previous .bedgraph.gz files. Shortly after uploading them originally, I had found and fixed the issue internally, but forgot to push the new files for y'all. bigwig files will be great so that we don't have to do the conversion from bedgraph. Thank you Jacob! Oh, it looks like my upload of the bigwigs didn't go through. I will get correct versions up by tomorrow.
Jacob Yes, the bedgraph files appear to be incorrect. Please use the bigwig files instead. I'll see if I can delete the bedgraph ones.
Jacob Hi, sorry for answering question on weekend.
But the baseline files, the validation files for bedgraph, the form seems strange.
The chromStart is 1,2 ,3, 4....
The chromEnd is 25, 50, 75, 100....
If I use bedgraphtobigwig, the error is----- Error - overlapping regions in bedGraph line 2 of C29.M29.bedGraph. which seems obvious.
Is there something wrong with the files? Or what kind of tools can I use to change bedgraph to bigwig.
Thanks!
Ying @sanjitsbatra @cying @jseth : I just pushed those annotation files to the repo. @sanjitsbatra and @cying thank you for pointing that out! We will add those files to the repo ASAP and post back here when they are available.
Seth
Hi!
May I ask, that where is the gene-annotations and enhance-annotations files?
best wishes,
Ying I see! It seems that the two files required for the hg38 are also not present in the annot/hg38 folder. Would it be possible to add those to the repo? @sanjitsbatra, no, everything for this challenge is GRCh38. The scoring script repo has an example for GRCh38 and one for hg19. @leepc12 I think we should remove the hg19 example to avoid confusion.
The -nth parameter is used to specify the number of threads that the scoring code uses. The calculated score will be the same for any number of threads. We will probably use multiple threads when we run the official scoring, but that won't change the results.
I hope this is helpful.
Seth
Hi!
On the scoring script github, since we are using hg19 for this challenge, (right?), the suggested command has the following two arguments:
--gene-annotations annot/hg19/gencode.v19.annotation.protein_coding.full.sorted.genes.bed.gz \
--enh-annotations annot/hg19/human_permissive_enhancers_phase_1_and_2.bed.gz
However, the repo doesn't contain the annot/hg19 folder. Is there somewhere else we can download these two files from?
Further, what is the purpose of these two flags: --prom-loc 80 --nth 1 ? Will these be the values used for these flags during the official scoring also?
Thanks
Sanjit Yes, those files should be good to go. We haven't scored them internally yet, but we're getting ready to do that soon. Please let us know if you have any questions or concerns about them. Hi Jacob,
I saw some files were added to the baseline folder.
are they ready to use as the average prediction?
Best,
Zhijian Thank you Jacob! That would be great. Hi Sanjit
I'm working on releasing the average activity baseline soon. This is the error that you get if you use the average activity at each position across all training cell types for a particular assay as the predictor for the test set. I'm also working on running the Avocado baseline. I'm not sure whether we're going to release that in the first round, or keep it internally and release it at some point in the future. We're still talking about that part!
Jacob Hi!
I am not sure if this in line with challenge guidelines or not, but I wanted to ask if anyone is working on running baselines on the dataset? For instance, running Avocado is computationally expensive but is perhaps the most natural thing to do and also a great baseline to compare against. Would it be possible for people to share the results of such baselines here?
Again, apologies if this is in violation with the challenge rules.
Best
Sanjit Hi Sanjit and All - The scoring code and documentation is available here: https://github.com/ENCODE-DCC/imputation_challenge
We plan to implement at least one additional scoring metric. When that is written and tested it will be merged into the master branch of that imputation_challenge repo. Watch the repo on GitHub to get automatic notifications about changes or additions.
Good luck!
Seth