Hi everyone,
It’s been quite a while that I am struggling with the following
simplified
code that runs fine and without any errors on my local machine but it fails inside a container on a head-less cluster that is managed with Slurm. I am using ray 1.9.2 and python 3.8.12 and torch 1.11.
import ray
from torch.utils.data import Dataset
class MainDataset(Dataset):
def __init__(self):
super().__init__()
self.num_frames = 1
def __len__(self):
return 0
def __getitem__(self, idx):
return 0
@ray.remote
class RemoteMainDataset(MainDataset):
def __init__(self):
super().__init__()
def get_num_frames(self):
return self.num_frames
if __name__ == '__main__':
ray.init(logging_level=30, local_mode=False, log_to_driver=False)
dataset = RemoteMainDataset.remote()
total_frames = ray.get(dataset.get_num_frames.remote())
print(total_frames)
This implementation prints 1 on my local machine but it produces the following error on the cluster inside an enroot (or docker) container. The error is:
2022-02-09 10:44:40,790 WARNING utils.py:534 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/netscratch/toosi/world_on_rails/rails/ray_example.py", line 26, in <module>
total_frames = ray.get(dataset.get_num_frames.remote())
File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 1715, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RemoteMainDataset.__init__() (pid=3217031, ip=192.168.33.210)
File "/netscratch/toosi/world_on_rails/rails/ray_example.py", line 18, in __init__
super().__init__()
TypeError: super(type, obj): obj must be an instance or subtype of type
Even after I change super().init() to super(RemoteMainDataset,self).init(), I get another error:
2022-02-09 10:51:43,057 WARNING utils.py:534 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/netscratch/toosi/world_on_rails/rails/ray_example.py", line 25, in <module>
dataset = RemoteMainDataset.remote()
File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 451, in remote
return self._remote(args=args, kwargs=kwargs)
File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 371, in _invocation_actor_class_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 714, in _remote
worker.function_actor_manager.export_actor_class(
File "/opt/conda/lib/python3.8/site-packages/ray/_private/function_manager.py", line 397, in export_actor_class
serialized_actor_class = pickle.dumps(Class)
File "/opt/conda/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/opt/conda/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
return Pickler.dump(self, obj)
_pickle.PicklingError: Can't pickle <functools._lru_cache_wrapper object at 0x7fc86634b550>: it's not the same object as typing.Generic.__class_getitem__
One interesting behavior is that I have this error when I am subclassing from torch.utils.data.Dataset. For instance, subclassing from nn.modules doesn’t make any errors and everything works fine.
It would be great if anyone who has had similar issues or has any idea, could help me with this problem. Thank you.