We have a large (1.2TB) dataset made up of many thousands of files - a researcher is trying to download it but running into API limits. How can they best proceed to get the data set?

Created by Hendrik Fink hfink
Apologies for the delay in response @Chandra.Suda. Are you still having trouble downloading the dataset?
@Chandra.Suda Using the "download list" seems like a good idea. Is the timeout due to "waiting for query results" a consistent problem or just a transient one? That is, if you try again do you get the same result? Also tagging @thomas.yu , in case he has other ideas for using the Synapse Python client to download large numbers of files.
Thank you @brucehoff. I have emailed them and am still waiting for a response. Alternatively, I tried adding some files to the download cart and use python to download the files. I ran this command: dl_list_file_entities = syn.get_download_list("L1_8") But I have received the below error. Can you please help me download the dataset? Any help to download the dataset would be greatly appreciated - even if it takes a while. --------------------------------------------------------------------------- SynapseTimeoutError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_18504\3427162359.py in 2 syn = synapseclient.login('chandra.suda',') 3 ----> 4 dl_list_file_entities = syn.get_download_list("L1_8") ~\Anaconda3\lib\site-packages\synapseclient\client.py in get_download_list(self, downloadLocation) 1445 :returns: manifest file with file paths 1446 """ -> 1447 dl_list_path = self.get_download_list_manifest() 1448 downloaded_files = [] 1449 new_manifest_path = f'manifest_{time.time_ns()}.csv' ~\Anaconda3\lib\site-packages\synapseclient\client.py in get_download_list_manifest(self) 1422 :returns: path of download list manifest file 1423 """ -> 1424 manifest = self._generate_manifest_from_download_list() 1425 # Get file handle download link 1426 file_result = self._getFileHandleDownload( ~\Anaconda3\lib\site-packages\synapseclient\client.py in _generate_manifest_from_download_list(self, quoteCharacter, escapeCharacter, lineEnd, separator, header) 1415 } 1416 } -> 1417 return self._waitForAsync(uri="/download/list/manifest/async", request=request_body) 1418 1419 def get_download_list_manifest(self): ~\Anaconda3\lib\site-packages\synapseclient\client.py in _waitForAsync(self, uri, request, endpoint) 3262 break 3263 else: -> 3264 raise SynapseTimeoutError('Timeout waiting for query results: %0.1f seconds ' % (time.time()-start_time)) 3265 if result.get('jobState', None) == 'FAILED': 3266 raise SynapseError( SynapseTimeoutError: Timeout waiting for query results: 601.1 seconds
> I tried looking into how to bundle the files, but wasn't able to find how. Can you please let me know how I can bundle the files together? Any resources or guide would be greatly appreciated. You would speak to the person who uploaded 750K files and ask them to create compressed archives of the small files so that there are a small number of large files. That person would then share the archives with you through Synapse.
@brucehoff I just got this error after 30 min of pausing. ValueError: File download failed during sync caused by FileNotFoundError: [Errno 2] No such file or directory: 'c:\\users\\sudac\\testtb\\longitudinal_data\\longitudinal_2\\longitudinal_01\\longitudinal_02\\longitudinal_03\\longitudinal_04\\longitudinal_05\\longitudinal_06\\longitudinal_07\\longitudinal_08\\longitudinal_09\\longitudinal_10\\1646895060674-recording-1.wav.synapse_download_101929592' Not sure if it has to do with the CODA TB Dream Challenge dataset. I have restarted the download (hope it doesn't duplicate files.) This is what I am receiving now: [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... [WARNING] Retrying in 16 seconds [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... [WARNING] Retrying in 16 seconds [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds....
Thank you @brucehoff for your quick response. I tried looking into how to bundle the files, but wasn't able to find how. Can you please let me know how I can bundle the files together? Any resources or guide would be greatly appreciated. Also, the zip download is not working. For some reason, only the longitudinal_01 and longitudinal_02 are downloading through zip (20K files out of 750K files). This is the command I ran in Command Prompt: synapse get -r syn39711400 Currently, my download has completely stopped. No new files are being added to the directory, and the command prompt has stayed still saying "downloaded 3.7GB (711.7kB/s). But if I look at Task manager, my ethernet is still receiving 16Mbps (this is the only thing I am running).
@Chandra.Suda ,there is inherent overhead in downloading a file so downloading large numbers of small files (as opposed to small numbers of large files) will naturally be slow. From what you say the average file size is 47KB. A better approach is to bundle the small files together into a small number of large (say, 10MB or100MB) files. Download from Synapse will then be much faster.
Hi, @thomas.yu and @brucehoff Hello. I am also receiving the same error now. It is very difficult to download the dataset, and I need to complete the download soon. This is the command I ran in Command Prompt: synapse get -r synXXXXXXXX This is the error I received: [WARNING] Retrying in 16 seconds. [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... The error continues and the download stops. There are 750K files (33GB). The download stops at around 500MB. But, after some time (5-10 Min), it starts downloading again. This is making the process extremely slow, and I need to finish my model soon. I have also emailed you. Please let me know how to fix this. Thank you for your time and help.
Hi @giorgioquer, Sure, please email me at thomas.yu@synapse.org with the email that you have linked to your synapse account. Thanks, Tom
Hi @thomas.yu I'm running synapse -u giorgioquer get -r synNNNNNN (can I share the project ID via email, as I'm not sure if this thread is all public) ?
Hi @giorgioquer Apologies for the inconveniences. Could you provide me with the command you ran? It would also help me immensely if you provided me with download access to the folder or project you are downloading so I can replicate this issue. Thanks, Tom
I tried restarting from scratch, waiting 24h since last activity. It downloads 6 files, then again blocked: [WARNING]Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... [WARNING] Retrying in 16 seconds [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... [WARNING] Retrying in 16 seconds [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... If possible let's follow up on Monday, please let me know if there is something I should try on my end
@thomas.yu to better answer your question, I'm not trying anything in parallel. I run the command once, then I needed to stop it from the command line. Then I run it again. Not sure what happened here, but since then I have that error.
I stopped it from command line and then restarted it, not sure if this is causing the issue.
Hi @giorgioquer , Are you just calling the `synapse` command once or are you running that command in parallel? Best, Tom
@thomas.yu : https://sagebionetworks.jira.com/browse/SYNPY-1201
Hi, I'm the researcher @hfink was referring to. I'm trying to download the dataset via command line "synapse -u MYUSERNAME get -r synNNNNN". It worked fine (for about 20 min) but I needed to stop from command line. I restarted it soon later but I got the aforementioned problem (and downloading was not restarting).
@hfink provided add'l information by email to the Synapse help email line: ``` [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... [WARNING] Requests are too frequent for API call: /entity/#/bundle2. Allowed 240 requests every 60 seconds.... [WARNING] Retrying in 16 seconds [WARNING] Too many concurrent requests. Allowed 3 concurrent connections at any time.... ``` @hfink , can you provide the script or command that your collaborator is running to download the data? Also, can you say more about the problem? For example, is the problem that the download stops? That it is too slow? (Tagging @KevinBoske and @thomas.yu to watch this thread.)
Can you say more about the "API limits" you mention? Specifically what is the symptom the researcher experiences?

Download Limits reached - [WARNING] Requests are too frequent for API call: allow 240 requests per 60 seconds page is loading…