i.e. if I write:
>meta ='/metadata/exams_metadata.tsv'
>META=open(meta,'r')
>line=line.rstrip()
>table=line.split("\t")
>age[pid]=float(table[7])
what I retrieve from table[7] is age, right?
just to make sure. I found if I shift by one column, all predictions become zero. So would like to make sure the structure in the leaderboard/final test set.
also, are there NAN/nan/NA in this table that I need to take care of?
Created by Yuanfang Guan ???? yuanfang.guan i already submitted...
i will do the modification in the next round > isn't "*" also NA?
@davecg The dot represents missing values. The asterisk ('*') should only be found in the metadata file from the scoring set in SC2 (obfuscated values). In that aspect, you are correct to specify it to the parser.
@yuanfang.guan: This is the good time to use the approach described by David and myself. Before the Validation Phase, we will release a checklist to help you submit a container that complies with a few guidelines (e.g. place all the development files in the directory `/src`). One of them will be to refer to columns by their name. This will allows you right now to not have to worry about column indexes anymore as well as in all your future experiments. Remember also that your container will be used during the Collaborative Phase before being released open source to the community. All the users will not have to worry about column indexes if we can state that the containers refer to data using the column name. If Perl doesn't not provide a library similar to pandas in Python, the best approach is to read the first line of the file (header) into a list and get the index of "age", for example. I know this is not the answer that you asked for but I hope that I convinced you that the above approach is the way to go.
Thanks! thanks both for help.
@tschaffter
ok. i now i read it.
question : no matter what i won't have time to deal with this this cycle. I have a full day of teaching and meetings on monday. and on weekends i have to keep both my eyes on my four kids. so it will have to be next round.
**can you please just confirm meta ='/metadata/exams_metadata.tsv' are in exact same format as training set?**
You can specify dtype on import, and if there is more than one NA value (isn't "*" also NA?), you can use a list for that.
This code is the same as Thomas' except I added the dtype argument.
```python
dtype = {
'subjectId':object, # pandas uses object for str for some reason
'examIndex': int,
'cancerL': int,
'cancerR':int,
}
metadata = pd.read_csv(examsMetadataFilename, sep="\t", na_values=['.', '*'],usecols=fields, dtype=dtype)
```
Hi Yuanfang,
Here is the example in python that I promised to share with you. In this example, I read only the columns that are going to be used. Moreover, the symbol that represents missing values (a dot '.') enables columns that include numeric and missing values to be correctly inferred as numeric columns, otherwise they would be considered as strings if they include at least one dot.
https://github.com/tschaffter/dm-docker/blob/master/dm-preprocess-png/generate_image_labels.py yes i will definitely learn that in the future.
but, for this challenge, there is no time to learn it now. so i just want to know if there is column shift and different weird characters in the age and exam sequence column. Are you using python?
You can use either DictReader (from standard library CSV, just set delimiter to "\t") to get back rows as dictionaries or read_table from Pandas to get whole document as a DataFrame.
Both let you write code that's agnostic to the actual column order.
Pandas and sklearn have good ways of dealing with null values too.
I'm processing as if there might be nulls in any demographic column.
Drop files to upload
Just to confirm, the age column is DEFINITELY at the 8th column right? page is loading…