Hello Regarding my last submission: 8457981, I can run it locally the docker in my machine with any issues however when I submit it to the inference sc1 queue I get this log after crashing. this is very strange because I have been able to submit other dockers without problems STDOUT: ? p?&" ? P ?W? p ?W? ?/? STDOUT: ? P STDOUT: ?W? !?C# ????f? ???f? P ? ? STDOUT: ?W? M?? STDOUT: ? 8 ??? A? ' ?? ??? ??? STDOUT: ?W? ?]?f? ?I?f? ???f? P STDOUT: ?W? `?k? ? ? ?? ?s ?? ? ?k? %hq_(-? ?lq_(-? ? ???f? @ ??? ? ?W? ? ? STDOUT: ? @ ?W? P ?f? ??&" ? ;?i?? ?i?? l???? ????? ? ? ? ? ;?i?? ?i?? l???? ????? j?? ??? ?s ?? j?? ?? q?K ? ??G? ??? ?f?? @5g?? ? ?W? ?K? STDOUT: ? @ @5g?? ?W? ?L? STDOUT: ? @5g?? ?W? ?L? STDOUT: ? @5g?? @ ?W? ?M? STDOUT: ? ` ?W @5g?? ? ?W? ??? STDOUT: ? ??? ??? @ ? ?W? @ P ? @ P ? ?s ?? ?W? ?? STDOUT: ? ? Pp? ??k? ? ? ??? ??k? ? l? ?? l? ?0! 3RN p ?W? T???? ? 0 J 0???? ? ?W? ?9?k? ?l{ ? ?k? ? l? ?uV ? ?W? ? ?W? i??k? ?l{ ? ?l{ ? *??k? ?q ?? ??J ?0! ?I ???k? P ?W? ??? ?W? P ?? p?, ?? ? ???????? STDOUT: ?? ???? ? ?W? ? ???????? ? 0 ?h{?? ?l{ ? ?l{ ? P q?G? ??? ????E u ? ? ?W? o ?Wc o w ? ?W? 0 @ ?W? | ? ???k? 8 ? ? ?W? &&? STDOUT: ? `?k? P P ? ?s ?? ? ?k? P P ?W? ???f? Pg{?^ Pp? P ?W? ??? STDOUT: ? ? ?W? ? P ? ? ?W? ?]? STDOUT: ? ?? ` ? | ? ?h?k? `?k? 8 P ? P ? ? ?W? s ? STDOUT: ? ? ?? P ? P ? P ? 0 ?W? ?? STDOUT: ? 0 ?W? ? ?? P ? p?? ??? STDOUT: ? p ?W? Io? STDOUT: ? ` ?k? ? ?W? ??6 ? ?W? ? ?? ??6 ? ?W? B? STDOUT: ? +??? ?s ?? ??6 ?? 0 ?W ? ?? ?JT ?s ?? ?F ??6 ?W? k/? STDERR: /sc1_infer.sh: line 21: 8 Segmentation fault (core dumped) python /root/bin/evaluation_merge.py STDOUT: ? ?W? ?s ?? ??6 Jf?*? ?s ?? 0??k? ????? ??6 ? ?W? pR?*? `??*? ??6 ?i?*? ?W? ? ?*? =3I ??? ?0! ?W? pR?*? `??*? ?R?*? ?o?? ? ?W? ??6 ?s ?? ??*? ?W? ? ? ??R P ?W? ?s ?? pR?*? ??M ?W? ???? ??? ? K `?k? ? ? ? ?37119? ?W? ? ?W? ? ?W? ?l{ ? P ?W? P ? ?W? eq?k? ??? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? / \ ? ?? ?W? <=?4? ? 9?R * \ 0 ?W? ?? ?? ?0! p?U ????? (J??? 0 ?W? 0 ?W? ??F =?4? ?Y? P ?4? 0?R 0 ?W? 0 ?W? ?? ? ?W? ?Za <=?4? ??? `&"??? ?{S ? ?W? @?M Pp??? Pp??? P??k? Q?*? ?_P ???k? P?6 ??6 P?6 ?M ??? P??k? P??k? ??k? ?uV ? ?W? =?4? =?4? P ?4? P??k? P?6 8 ?tS ?? ?? P?6 0?M ? ?? P??k? ????? ?8P ?9P ????? P??k? Mm?4? P?6 ??I I?L ?$?j? ?]P 0???? Dm?4? ????? ?0! ?l?4? pT?4? ???? ???????4? ????? ?l{ ? HR??? U?J 0 p ?W? ? ?W? ?0! ???4? ú ???4? ú W?L ?? ?? ????? ????? ?? ? ?0! P??k? h@??? ? J ? ? ? ?W? 0???? ? ? ? ?W? (s??? ?]P ?s ?? ?]P 0? ?? (s??? ??M 0 P@??? ? J ?? ???4? ??M HR??? ?V??? ? ?j? 1V 0?? 0??k? ???k? ?a{ ? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? (s??? ??R W??? W??? ?9Y ? (s??? ? ?W? P@??? u?L ? ?W? ??R ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? ? ?W? p ?W? ? ?W? ? Mu?k? E?k? ?k? ? T hN?k? ?0! ??T ?k? ??k? %?Y P@??? ?? l? ???k? ?k? P??k? ez?? @?M ?? ez?? ??M xH??? P??k? ???4? xH??? P??k? ??K ???k? hN?k? jC ? ?W? ej{ ? P l? ? p?? 0 ?4? +4?k? P@??? P??k? d?- 0 ?4? ?? @?K ??I \ ??? ú ?- ?? ?0! p??k? ???k? @ ?W???? ????0??k? ?? 0 P@?k? h1?k? h? l? ?0! ? ?k? 0??4? ? 0 ?rS ? 0??k? P@?k? W?L ???? ?? ?? 0 ?W? 0??k? ?0! P??k? ú ?A?k? J h1?k? TsP 0??k? h1?k? 0??k? ?G ? 5?? ?l??? uV ??&" ? ?k? ??+ x?>?? ?A?k? E?I P@?k? ??+ ?A?k? ?0! ?l$ ? Pf l? ?0! ???? ???????k? ?A?k? ?A?k? &" ???k? P?&& ?&&W ???k? ???k? ???k? ???k? ?L ?A?k? h1?k? ???k? P@?k? ?0! ?A?k? 4 J ??k? ??&& ??&& ?, p?' ??W? E>?W? a>?W? s>?W? ?>?W? ?>?W? ??W? ??W? /??W? 7??W? B??W? S??W? g??W? |??W? ???W? ???W? ???W? ! ?W? ??? d @ @ 8 ?k? %?W ?$?W? ???W? ?$?W? ? !?C# ? A)?_n??x86_64

Created by Alberto Albiol alalbiol
> Please cancel submissions 8464317 and 8464238, so we can submit the fixed one Done . You should receive error messages saying that these two submissions failed, but they stopped because we manually intervened.
Dear moderators, we think that we finally found a posible found that explained this strange behaviour (basically the bug corrupted the memory and the program behaved randomly) Please cancel submissions 8464317 and 8464238, so we can submit the fixed one Apologies
I'm from Alberto's team '42 is the answer'. I've just submited a Job produced on last Sunday that finished as core dumped... The job (submited on sunday) worked during 3 days and then coredumped. We have tried to reproduce why was this situation and the result is shown above. Now We have tried to 'resubmit' the job that was alived during some days and the result is that is: ``` STDOUT: READING PATIENT DATABASE: :[20000] 20000 STDOUT: READING PATIENT DATABASE: :[30000] 30000 STDOUT: READING PATIENT DATABASE: :[40000] 40000 STDOUT: [***] [ READING PATIENT DATABASE: ] Processed Elements: 42817 STDOUT: [mamo_contest::application::application()] /root/modelState/porpoise_filename_data_3.tsv not found. STDOUT: Exam list not present using only: "/metadata/images_crosswalk.tsv" STDOUT: [void patient_list::parse_files(const path&)] "?z??(?????(????(???|??)???)????(????lN.??52??????+????+???x(??? |u????*????jN.?)????+???? ?.? ?.??5a.??g??g??g??g?g??g?M7a.??g?57a.? ?.`*????M7a.?e.?)???@I7a.?H7a.?H7a.P`i?-?+???.???P`i?-R?1.?+????-????+??? 1.?+????-????+???VK0.????-????+????K0.???,???1CN..????,????0????3N.?????Gf.X????=??mH?h?7'f.?jh?,????f?-???H?h;??H?s ?,???X??/??52???.1.0-???H?h[nw/-????7fa9'f.|??Gf.?S7a.???h-?????-.f?-?-?????-.?E0. ??.?-?????-.x??. ??.?-????6/.p??. ??.?-?????/.p??..??? .????N./????bG????67a.?67a.?)hP/???M?L.8P?A?'????s ?jh/?????5a.??5a.?M7a./???`Gf.??????P`i?-???e.>d?t[?4d?t[?xz??67a.0z?P?`1?????L.?1???P?5a.?.;q??-???-l???-?*??-??????;q??-???-l???-?*??-???-?gP`i?-???-???q?p .1?52???\??-@???-?0????;/.0z?@???-?0???????0'U?-?.0=????(?-?]PP`i?-?]P?j?-?(?-??M0????-?J???:?,.??M??2?-8?2?-??3e.1V y0pCf.?~Cf.???.P>???`>???p>????>????>????(?-??RX?2?-X?2?-?9Y- ?(?-0????????-u?L0??????R?@????????0????@????p?????????????? ????P@???Z?M??e.?Cf.Tf.?Th?Cf.?????TTf.Q7f.%?Y????-?Gf.?P7f.Tf.P`Cf.??-@?M????-??M????-P`Cf.?:?,.????-P`Cf.??K?P7f.h?Cf.jCP@???e??.P?Ef.6.p??0??,.??2f.????-P`Cf.d^?0??,.?y@?K??I\ ???-?y?]?Xy???pECf.?4Af.?A??????????0E1f.?y0P?Af.h?@f.hGf.??????e.0?,.?0?rS?0E1f.P?Af.W?L?????y?y?A???0E1f.???P`Cf.?y??Af. Jh?@f.TsP0E1f.h?@f.0E1f.?Gh?t?-? ??- uV?q???0f.??x?~?-??Af.E?I.P?Af.?????Af.????,?.P?Ff.????????????H1f.??Af.??Af.&"?H1f.P???&&W?=@f.?H1f.?H1f.?H1f.?L??Af.h?@f.?H1f.P?Af.?????Af.4J ?e.????????pZ?]????????h?@f.??DpZ?h?@f.h?@f.?|?f.h?@f.]??????*]?????D`E??????`E???fr???]??????G???l1f.]????D??Hf.??????Hf. ?(A??? ??Xh?? ???X??XQ?p60F???%?WG???E?e.G?????D-a\?1/?%?WG???-??R??A-??R?s@??[G???%?WG???N?WG??? STDERR: /sc1_infer.sh: line 21: 8 Segmentation fault (core dumped) python /root/bin/evaluation_merge.py STDOUT: ]???]???/]????]???U]???d]???{]????]????]???5^???E^???a^???s^????^????^???_???_???/_???7_???B_???S_???g_???|_????_????_????_???!P??????d@@8 ?&&f. %?W )I????_???9I?????bG???0.??_??x86_64 ``` But the reason to stop last time was, just: ``` STDOUT: Data Shape = (4, 339) STDOUT: X shape = (4, 339) STDOUT: getting studio scores STDOUT: 9957 R 0.105210187559 STDOUT: 9957 L 0.0125300827631 STDOUT: Batch Processed: elapsed since start: 164035.803323 STDOUT: 9965 STDOUT: New batch elapsed since start: 164046.099556 STDOUT: The score label new_score is not available STDOUT: Data Shape = (4, 339) STDOUT: X shape = (4, 339) STDOUT: getting studio scores STDOUT: 9965 R 0.0756200886285 STDOUT: 9965 L 0.0292592714558 STDOUT: Batch Processed: elapsed since start: 164046.358021 STDOUT: 9969 STDOUT: New batch elapsed since start: 164057.629836 STDOUT: The score label new_score is not available STDOUT: Data Shape = (5, 339) STDOUT: X shape = (5, 339) STDOUT: getting studio scores STDOUT: 9969 R 0.107686007631 STDOUT: 9969 L 0.0256583138454 STDERR: /sc1_infer.sh: line 21: 7 Segmentation fault (core dumped) python /root/bin/evaluation_merge.py STDOUT: Batch ``` Which is quite diferent (in the job produced in Sunday the segmentation fault was later to try to read the inference files from the **/metadata** directory. Now all jobs stops just tring to read the **/metadata** directory, the same as the job we prepared to just know what was the issue. So about your question: ``` Do you obtain systematically the same output when resubmitting the same job? ``` **Yes**, and we have the same sistematic output even with jobs that has passed the first steps. And I can give now more information: **No**, We still don't know if the problem we had previous to these submissions is due to our fault or just something that will be fixed.
Alberto, Do you obtain systematically the same output when resubmitting the same job?
It looks like you have some kind of serious error in your code, causing a core dump. Here's a bit more of your log file, which might help you "sleuth" the problem: ``` Loading init weights from : model_init_weights.hdf5 Simulation starts at 2017-03-15 22:00:28.037119 **************** Init test generator reto 1 READING PATIENT DATABASE: :[1] 1 READING PATIENT DATABASE: :[2] 2 READING PATIENT DATABASE: :[3] 3 READING PATIENT DATABASE: :[4] 4 READING PATIENT DATABASE: :[5] 5 READING PATIENT DATABASE: :[6] 6 READING PATIENT DATABASE: :[7] 7 READING PATIENT DATABASE: :[8] 8 READING PATIENT DATABASE: :[9] 9 READING PATIENT DATABASE: :[10] 10 READING PATIENT DATABASE: :[20] 20 READING PATIENT DATABASE: :[30] 30 READING PATIENT DATABASE: :[40] 40 READING PATIENT DATABASE: :[50] 50 READING PATIENT DATABASE: :[60] 60 READING PATIENT DATABASE: :[70] 70 READING PATIENT DATABASE: :[80] 80 READING PATIENT DATABASE: :[90] 90 READING PATIENT DATABASE: :[100] 100 READING PATIENT DATABASE: :[200] 200 READING PATIENT DATABASE: :[300] 300 READING PATIENT DATABASE: :[400] 400 READING PATIENT DATABASE: :[500] 500 READING PATIENT DATABASE: :[600] 600 READING PATIENT DATABASE: :[700] 700 READING PATIENT DATABASE: :[800] 800 READING PATIENT DATABASE: :[900] 900 READING PATIENT DATABASE: :[1000] 1000 READING PATIENT DATABASE: :[2000] 2000 READING PATIENT DATABASE: :[3000] 3000 READING PATIENT DATABASE: :[4000] 4000 READING PATIENT DATABASE: :[5000] 5000 READING PATIENT DATABASE: :[6000] 6000 READING PATIENT DATABASE: :[7000] 7000 READING PATIENT DATABASE: :[8000] 8000 READING PATIENT DATABASE: :[9000] 9000 READING PATIENT DATABASE: :[10000] 10000 READING PATIENT DATABASE: :[20000] 20000 READING PATIENT DATABASE: :[30000] 30000 READING PATIENT DATABASE: :[40000] 40000 [***] [ READING PATIENT DATABASE: ] Processed Elements: 42817 [mamo_contest::application::application()] /root/modelState/porpoise_filename_data_3.tsv not found. Exam list not present using only: "/metadata/images_crosswalk.tsv" [void patient_list::parse_files(const path&)] "?W??(?W?(?W?X?W?X?W???W??|? ? ?G???? ???W??W???W?p?? ?? ??P?W??z? ? ?W???W??0! ??0! ??@?f?????????e???????f??????f??0! ??W????f??k??W?@??f????f????f??s????W?P ?W??s??R? ???W?0 ?W???W? $? ??W?0 ?W??W?V[? ???0 ?W?0?W??[? ? ??P?W?1S? ?P ?W??W? ?W??C? ????l?H????=x????k??????W?? ?W?;???W??/??G?????1??W?[n?W??+a??k?|?l????f????W???? ?f?? ?W???? ??U? ? ?&" ??W???? ?x?&" ? ?&" ? ?W??F? ?p?&" ? ?&" ?0 ?W???? ?p?&" ?P ?W?p ?W??/? ?P ?W?!?C#????f????f??? ?W?M?? ?8???A?'???????? ?W??]?f??I?f????f?P ?W?`?k??????s????k?%hq_(-??lq_(-????????? ???? ?W?P?f???&" ?;?i???i??l?????????????;?i???i??l?????????j??????s??j????q?K ???G?????f??@5g??? ?W??K? ?5g?? ?W??L? ?@5g?? ?W??L? ?@5g????@ ?W??M? ?` ?W@5g??? ?W???? ???????? ?? ???s?? ?W? ?W?T?????0??????kJ0??????W??9?k??l{??k??l??uV??W???W?i??k??l{??l{?*??k??q????J?0!?I???k?P?W?????W?P??p?,?? ????????? ???W?&&?W???????????0?h{???l{??l{?Pq?G????????Eu???Wo?Wcow??W?0@?W?|????k?8 ?`?k?P??s????k?PP?W????f?Pg{?^Pp?P?W???? ???W?????W??]? ???|??h?k?`?k?8????W?s? ??????0?W??? ?0?W????p????? ?p?W?Io? ?`?k???W???6??W?????6??W?B? ?+????s????6??0?????JT?s???F??6?W?k/? /sc1_infer.sh: line 21: 8 Segmentation fault (core dumped) python /root/bin/evaluation_merge.py ??W??s????6Jf?*??s??0??k????????6??W?pR?*?`??*???6?i?*? ?W?? ?*?=3I ????0! ?W?pR?*?`??*??R?*??o????W???6?s????*??W?????RP?W??s??pR?*???M?W?????????K`?k???37119??W???W???W??l{?P?W?P??W?eq?k??????W???W???W???W???W???W???W???W?/\ ????W?<=?4??9?R*\0?W??????0!p?U?????(J??? 0?W?0?W???F=?4??Y? P ?4?0?R 0?W?0?W?????W?Za<=??4????`&"????{S??W?@?MPp???Pp???P??k?Q?*??_P???k?P?6??6P?6?M???P??k?P??k???k? ?uV??W?=?4?=?4?P ?4?P??k?P?68?tS????P?60????P??k???????8P?9P?????P??k?Mm?4?P?6??II?L?$?j??]P0????Dm?4???????0!?l?4?pT?4????????????4???????l{?HR???U?J0p?W???W??0!???4? ú???4? úW?L?? ????????????????0!P??k?h@????J????W?0????????W?(s????]P?s???]P0? ??(s?????0P@????J?????4???MHR????V???? ?j???1V0??0??k????k??a{???W???W???W???W???W?(s?????RW???W????9Y?(s?????W?P@???u?L??W???R?????W???W???W???W???W???W?p?W?????Mu?kE?k??k??ThN?k??0!??T?k???k?%?YP@?????l????k??k?P??k?ez??@?M??ez????MxH???P??k????4?xH???P??k???K???k?hN?kjC???ej{?Pl??p??0?4?+4?k?P@???P??k?d?-0?4???@?K??I\ ??? ú?-???0!p??k????k?@?W????????0??k???0P@?k?h1?k?h?l??0!??k?0??4??0?rS?0??k?P@?k?W?L????????0?W?0??k??0!P??k? ú?A?k? uV??&"??k???+x?>???A?k?E?IP@?k???+?A?k??0!?l$?Pfl??0!???????????k??A?k??A?k?&"???k?P?&&?&&W???k????k????k????k??L?A?k?h1?k????k?P@?k??0!?A?k?4J??k???&&??&&?,p?'??W?E>?W?a>?W?s>?W??>?W??>?W???W???W?/??W?7??W?B??W?S??W?g??W?|??W????W????W????W?!?W????d@@8 ?k? %?W ?$?W????W??$?W??!?C#?A)?_n??x86_64 ```

corrupted docker? page is loading…