If you have started using the cloud TPUs, can you leave feedback for the TRC team by commenting on this thread? Thanks.

Created by Abdul Muntakim Rafi muntakimrafi
I think I misunderstood how TPU works, it might be stupid but still I decide to write down why. For PyTorch XLA, a single process can only see one device (which is very different from NVIDIA GPU). So I followed the example script (attached below) to spawn 8 processes, each accessing one of the devices, now it works quite well and very fast! Although I still haven't found a good way to visualize the TPU usage, the tensorboard trace viewer is not very intuitive... https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py
Not sure what the issue is. Can you restart the TPU-VM? [go to the console. stop it. and then start it again.]
I have a naive question, for a v3-8 core, how many devices should I have? I tried pytorch-lightning yesterday and it seemed I have 8 devices shown in Tensorboard, then I encountered some performance issue on pytorch-lightning on TPU. I rewrote the script in the native pytorch way today, then I found on the current tpu-vm I only have 1 device (returned by torch_xla.core.xla_model.xrt_world_size()), also the same lightning script failed with complaining there is only one device, which confused me that I don't know how many devices should I expect. This is the screenshot of tensorboard profile on lightning training yesterday: https://www.dropbox.com/s/cbop07pktjdfqfp/Screen%20Shot%202022-06-16%20at%206.27.43%20PM.png?dl=0 Then today I only got 1 device according to torch_xla.core.xla_model.xrt_world_size(), also the same lightning training failed on 8 cores: https://www.dropbox.com/s/an86gpx4hox1qn7/Screen%20Shot%202022-06-16%20at%206.33.44%20PM.png?dl=0
@muntakimrafi I tried that and it didn't work, I finally figure out this is due to I'm not using the default network, which needs additional firewall configuration to access the ssh port, the command I used is: ``` gcloud compute firewall-rules create --network=network-name allow-ssh --allow=tcp:22 ``` Anyway, thanks for the help
Have you tried [this](https://stackoverflow.com/questions/26193535/error-gcloud-compute-ssh-usr-bin-ssh-exited-with-return-code-255)? @Jianyu
I'm not sure if it's due to there is no tpu available or something else, the thing is every time I try to connect to the worker it keeps saying connection timeout, is there anyone having an idea on how to deal with this problem? ``` $ gcloud alpha compute tpus tpu-vm ssh test --project ${PROJECT_ID} --zone ${ZONE} --ssh-flag="-4 -L 9009:localhost:9009" SSH: Attempting to connect to worker 0... ssh: connect to host 34.91.66.234 port 22: Connection timed out Retrying: SSH command error: [/usr/bin/ssh] exited with return code [255]. ```
We trained a Pytorch model on the TRC and were quite happy with the experience. However, the training speed was abysmal when we shifted to Pytorch Lightning. It might be related to [this bug](https://github.com/PyTorchLightning/pytorch-lightning/issues/13088). Has anyone got around this? A minor inconvenience was that, as far as I could tell, there was no way to create an "image" for the TPU that contained our code and environment. So every time we instantiated a VM we needed to copy Github credentials, clone the repo, install dependencies, set up env variables, etc.
Pretty positive, though V3-8 model is hot and sometime hard to get one up .
@Xiaoting.Chen How is your experience with TRC so far?
I think this can not be done. If you need more TPU cores, they should want you to request v3-16, v3-64, and so on.
Is there a way to "chain" all the TPU VMs in one region so users don't have to copy their files/settings to individual TPUs and their attached disk space and potentially can get more computing power. It looks this can be done via proper setting up tf.distribute.cluster_resolver.TPUClusterResolver() function and/or with other cloud settings, but there is no solid example of how the "gRPC address" should be configured and other necessary steps under such scenarios.

Cloud TPU feedback page is loading…