I think I misunderstood how TPU works, it might be stupid but still I decide to write down why. For PyTorch XLA, a single process can only see one device (which is very different from NVIDIA GPU). So I followed the example script (attached below) to spawn 8 processes, each accessing one of the devices, now it works quite well and very fast! Although I still haven't found a good way to visualize the TPU usage, the tensorboard trace viewer is not very intuitive... https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py
Not sure what the issue is. Can you restart the TPU-VM? [go to the console. stop it. and then start it again.]
I have a naive question, for a v3-8 core, how many devices should I have? I tried pytorch-lightning yesterday and it seemed I have 8 devices shown in Tensorboard, then I encountered some performance issue on pytorch-lightning on TPU. I rewrote the script in the native pytorch way today, then I found on the current tpu-vm I only have 1 device (returned by torch_xla.core.xla_model.xrt_world_size()), also the same lightning script failed with complaining there is only one device, which confused me that I don't know how many devices should I expect. This is the screenshot of tensorboard profile on lightning training yesterday: https://www.dropbox.com/s/cbop07pktjdfqfp/Screen%20Shot%202022-06-16%20at%206.27.43%20PM.png?dl=0 Then today I only got 1 device according to torch_xla.core.xla_model.xrt_world_size(), also the same lightning training failed on 8 cores: https://www.dropbox.com/s/an86gpx4hox1qn7/Screen%20Shot%202022-06-16%20at%206.33.44%20PM.png?dl=0
@muntakimrafi I tried that and it didn't work, I finally figure out this is due to I'm not using the default network, which needs additional firewall configuration to access the ssh port, the command I used is: ``` gcloud compute firewall-rules create --network=network-name allow-ssh --allow=tcp:22 ``` Anyway, thanks for the help
Have you tried [this](https://stackoverflow.com/questions/26193535/error-gcloud-compute-ssh-usr-bin-ssh-exited-with-return-code-255)? @Jianyu
I'm not sure if it's due to there is no tpu available or something else, the thing is every time I try to connect to the worker it keeps saying connection timeout, is there anyone having an idea on how to deal with this problem? ``` $ gcloud alpha compute tpus tpu-vm ssh test --project ${PROJECT_ID} --zone ${ZONE} --ssh-flag="-4 -L 9009:localhost:9009" SSH: Attempting to connect to worker 0... ssh: connect to host port 22: Connection timed out Retrying: SSH command error: [/usr/bin/ssh] exited with return code [255]. ```
We trained a Pytorch model on the TRC and were quite happy with the experience. However, the training speed was abysmal when we shifted to Pytorch Lightning. It might be related to [this bug](https://github.com/PyTorchLightning/pytorch-lightning/issues/13088). Has anyone got around this? A minor inconvenience was that, as far as I could tell, there was no way to create an "image" for the TPU that contained our code and environment. So every time we instantiated a VM we needed to copy Github credentials, clone the repo, install dependencies, set up env variables, etc.
Pretty positive, though V3-8 model is hot and sometime hard to get one up .
@Xiaoting.Chen How is your experience with TRC so far?
I think this can not be done. If you need more TPU cores, they should want you to request v3-16, v3-64, and so on.
Is there a way to "chain" all the TPU VMs in one region so users don't have to copy their files/settings to individual TPUs and their attached disk space and potentially can get more computing power. It looks this can be done via proper setting up tf.distribute.cluster_resolver.TPUClusterResolver() function and/or with other cloud settings, but there is no solid example of how the "gRPC address" should be configured and other necessary steps under such scenarios.

