If you have started using the cloud TPUs, can you leave feedback for the TRC team by commenting on this thread? Thanks.
Created by Abdul Muntakim Rafi muntakimrafi I think I misunderstood how TPU works, it might be stupid but still I decide to write down why. For PyTorch XLA, a single process can only see one device (which is very different from NVIDIA GPU). So I followed the example script (attached below) to spawn 8 processes, each accessing one of the devices, now it works quite well and very fast! Although I still haven't found a good way to visualize the TPU usage, the tensorboard trace viewer is not very intuitive...
https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py Not sure what the issue is. Can you restart the TPU-VM?
[go to the console. stop it. and then start it again.] I have a naive question, for a v3-8 core, how many devices should I have? I tried pytorch-lightning yesterday and it seemed I have 8 devices shown in Tensorboard, then I encountered some performance issue on pytorch-lightning on TPU. I rewrote the script in the native pytorch way today, then I found on the current tpu-vm I only have 1 device (returned by torch_xla.core.xla_model.xrt_world_size()), also the same lightning script failed with complaining there is only one device, which confused me that I don't know how many devices should I expect.
This is the screenshot of tensorboard profile on lightning training yesterday:
https://www.dropbox.com/s/cbop07pktjdfqfp/Screen%20Shot%202022-06-16%20at%206.27.43%20PM.png?dl=0
Then today I only got 1 device according to torch_xla.core.xla_model.xrt_world_size(), also the same lightning training failed on 8 cores:
https://www.dropbox.com/s/an86gpx4hox1qn7/Screen%20Shot%202022-06-16%20at%206.33.44%20PM.png?dl=0 @muntakimrafi I tried that and it didn't work, I finally figure out this is due to I'm not using the default network, which needs additional firewall configuration to access the ssh port, the command I used is:
```
gcloud compute firewall-rules create --network=network-name allow-ssh --allow=tcp:22
```
Anyway, thanks for the help Have you tried [this](https://stackoverflow.com/questions/26193535/error-gcloud-compute-ssh-usr-bin-ssh-exited-with-return-code-255)? @Jianyu I'm not sure if it's due to there is no tpu available or something else, the thing is every time I try to connect to the worker it keeps saying connection timeout, is there anyone having an idea on how to deal with this problem?
```
$ gcloud alpha compute tpus tpu-vm ssh test --project ${PROJECT_ID} --zone ${ZONE} --ssh-flag="-4 -L 9009:localhost:9009"
SSH: Attempting to connect to worker 0...
ssh: connect to host 34.91.66.234 port 22: Connection timed out
Retrying: SSH command error: [/usr/bin/ssh] exited with return code [255].
```
We trained a Pytorch model on the TRC and were quite happy with the experience. However, the training speed was abysmal when we shifted to Pytorch Lightning. It might be related to [this bug](https://github.com/PyTorchLightning/pytorch-lightning/issues/13088). Has anyone got around this?
A minor inconvenience was that, as far as I could tell, there was no way to create an "image" for the TPU that contained our code and environment. So every time we instantiated a VM we needed to copy Github credentials, clone the repo, install dependencies, set up env variables, etc. Pretty positive, though V3-8 model is hot and sometime hard to get one up .
@Xiaoting.Chen How is your experience with TRC so far? I think this can not be done.
If you need more TPU cores, they should want you to request v3-16, v3-64, and so on. Is there a way to "chain" all the TPU VMs in one region so users don't have to copy their files/settings to individual TPUs and their attached disk space and potentially can get more computing power. It looks this can be done via proper setting up tf.distribute.cluster_resolver.TPUClusterResolver() function and/or with other cloud settings, but there is no solid example of how the "gRPC address" should be configured and other necessary steps under such scenarios.