Consistent failures trying to download large dataset

I'm using the command line to download a large publicly available dataset and am making progress, but I am having to re-initiate the download constantly to continue the download process. I'm getting a few files at a time before it crashes with various errors, including: ``` libgcc_s.so.1 must be installed for pthread_cancel to work. # most common (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 16] Device or resource busy')) caused by OSError: [Errno 24] Too many open files: '/home/darren/.synapseCache/652/40062652/.cacheMap.lock' ``` Rerunning the `get` call allows for some restoration of file downloads, presumably by accessing the `.synapseCache` but it always results in an error. Is there anything I can I do about this? Here's the command I'm running: `synapse get -r syn17865732` Once fail occurs I rerun the same command and the size of the directory containing the dowloaded data appears to (generally) keep increasing, although I have seen it drop occasionally (possibly removing partial downloads?) Ideally, a system more like `rsync` that can run in the background and automatically resume after errors would be much preferred. Thanks, Darren

Created by Darren Tyson darrentyson
Hi @darrentyson, Thanks for the feedback. The reason for the difference in behavior between the command line version and the programmatic invocation of syncFromSynapse is that the current working directory serves as a decent default for a download related command (e.g. wget) but when invoked [programmatically](https://python-docs.synapse.org/build/html/synapseutils.html#synapseutils.sync.syncFromSynapse)there isn't necessarily a decent default equivalent (i.e. the current working directory of the python process invoking the function is not necessarily likely to be a good default), and without a location specified the function ensures instead that the files are downloaded to the cache (which is similar to the behavior of the client download for a single file). I take your point that this behavior may not be what the user intended and we'll evaluate how to make this clearer. I'm going to guess that this was run on a shared system with a large number of CPUs? I think the caused by "OSError: [Errno 24] Too many open files" may have to do with a ulimit per process imposed by the system. Generally synapseclient will try to use additional concurrency available to download multiple files faster, but it may be that in this case the default concurrency selected based on the number of processors on the system is conflicting with a system limit on the number of file handles that a process can have opened at the same time. However I'm not quite sure why this would differ between the command line and Python invocations (unless perhaps invoking the Python function was done in a different shell or by a different user with different limits) If you have a chance it might be revealing to run the following commands to see if there are limits: ``` # assuming that this a linux variant # see ulimits on file descriptors and the number of processors ulimit -Hn ulimit -Sn cat /proc/cpuinfo | grep processor | wc -l ``` It may be that we should try to determine these limits from synapseclient and reduce concurrency accordingly, I have opened an issue for that for us to address. Thanks for the error log as well. I'm going through it the stack does make it appear that there are some DNS issues ("Name or service not known") although this is perhaps a side effect of other errors encountered. Jordan
OK, I've been able to download the data of interest using the Python code rather than the command line functions. However, on the website you (Synapse) give the following code recommendations to download: **Using Command line:** ${imageLink?synapseId=syn25455592&align=None&scale=50&responsive=true&altText=Command line code} **Using Python:** ${imageLink?synapseId=syn25455590&align=None&scale=50&responsive=true&altText=Python code} However, the default actions of these codes are different. The command line assumes you want to download all of the files to the current directory whereas the Python code assumes you want to load the files into cache but does not save any of the associated metadata except for the last object downloaded. After having downloaded about 450 GB of data into the cache, I realized there was no way to reestablish the full file and directory structure from the cached files and was forced to download all the data again by changing the `sync` call from: `files = synapseutils.syncFromSynapse(syn, 'syn17773758')` to `files = synapseutils.syncFromSynapse(syn, 'syn17773758', path="/data/Rashid_et_al_subset/DATASET-1")` It is nice having the code snippets available like that, but I would strongly suggest you add something like: `files = synapseutils.syncFromSynapse(syn, 'syn17773758', path='path_where_data_will_be_saved')` so that the behaviors would be the same with either method. I still do not know why the command line code could not complete the download. Darren
Thanks. I uploaded the error log to Synapse (syn25453909). Let me know if you have any problems accessing. Cheers, Darren
The "KeyError" message suggests a software bug. Can you rerun with the " --debug " option: ``` synapse --debug get ... ``` and share the output?
Now the download is consistently failing at the same spot. I am getting the following error: ``` ValueError: File download failed during sync caused by KeyError: 'syn19109220' ``` I am able to create a zip file and download the entire syn19109220 subdirectory from the website, so it doesn't appear to be a problem with the files, per se. Is there any way to skip/ignore/exclude this subdirectoy when calling the `get` function? There doesn't appear so based on the documentation. Is there another workaround? I am unsure what files still need to be downloaded. It appears that the `sync` command would be very useful?if it could be used to download as well as upload. Any help would be appreciated!
I think I have at least fixed the `libgcc_s.so.1 must be installed for pthread_cancel to work` error by adding the `libgcc` package to my conda environment with `conda install libgcc`, but other errors are still occurring. Most recent error: ``` ValueError: File download failed during sync caused by OSError: [Errno 24] Too many open files: '/home/darren/.synapseCache/836/40057836/.cacheMap.lock' ```

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Consistent failures trying to download large dataset page is loading…