故障排除

安装过程中导入TensorFlow失败

  1. 是否安装了TensorFlow?

如果你看到以下错误信息,这意味着 TensorFlow 未安装。请在安装 Horovod 之前先安装 TensorFlow。

error: import tensorflow failed, is it installed?

Traceback (most recent call last):
  File "/tmp/pip-OfE_YX-build/setup.py", line 29, in fully_define_extension
    import tensorflow as tf
ImportError: No module named tensorflow
  1. CUDA库是否可用?

如果你看到以下错误信息,这意味着TensorFlow无法加载。 如果你在没有GPU的机器容器中安装Horovod,可以使用CUDA存根驱动程序来解决这个问题。

error: import tensorflow failed, is it installed?

Traceback (most recent call last):
  File "/tmp/pip-41aCq9-build/setup.py", line 29, in fully_define_extension
    import tensorflow as tf
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

要使用CUDA存根驱动程序:

# temporary add stub drivers to ld.so.cache
$ ldconfig /usr/local/cuda/lib64/stubs

# install Horovod, add other HOROVOD_* environment variables as necessary
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod

# revert to standard libraries
$ ldconfig

安装过程中未找到MPI

  1. MPI是否在PATH中?

如果你看到以下错误信息,这意味着在PATH中未找到mpicxx。通常mpicxxmpirun位于同一目录。 在安装Horovod之前,请将包含mpicxx的目录添加到PATH中。

error: mpicxx -show failed, is mpicxx in $PATH?

Traceback (most recent call last):
  File "/tmp/pip-dQ6A7a-build/setup.py", line 70, in get_mpi_flags
    ['mpicxx', '-show'], universal_newlines=True).strip()
  File "/usr/lib/python2.7/subprocess.py", line 566, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

要使用自定义MPI目录:

$ export PATH=$PATH:/path/to/mpi/bin
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
  1. MPI库是否已添加到$LD_LIBRARY_PATHld.so.conf中?

如果你看到以下错误信息,这意味着 mpicxx 无法加载某些 MPI 库。如果你最近安装了 MPI,请确保 MPI 库的路径存在于 $LD_LIBRARY_PATH 环境变量中,或者在 /etc/ld.so.conf 文件中。

mpicxx: error while loading shared libraries: libopen-pal.so.40: cannot open shared object file: No such file or directory
error: mpicxx -show failed (see error below), is MPI in $PATH?

Traceback (most recent call last):
File "/tmp/pip-build-wrtVwH/horovod/setup.py", line 107, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['mpicxx', '-show']' returned non-zero exit status 127

如果您已将MPI安装在用户目录中,可以将MPI库目录添加到$LD_LIBRARY_PATH

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/mpi/lib

如果你在非标准系统位置安装了MPI(即不在/usr/usr/local),你应该将其添加到 /etc/ld.so.conf文件中:

$ echo /path/to/mpi/lib | sudo tee -a /etc/ld.so.conf

此外,如果您已在系统位置安装了 MPI,安装后应运行 sudo ldconfig 以在缓存中注册库:

$ sudo ldconfig

安装过程中出错:从‘const void*’到‘void*’的无效转换 [-fpermissive]

如果你看到以下错误信息,这意味着你的MPI可能已过时。我们推荐安装 Open MPI >=4.0.0

注意: 在安装新版本的Open MPI之前,请务必移除您现有的MPI安装。

horovod/tensorflow/mpi_ops.cc: In function ‘void horovod::tensorflow::{anonymous}::PerformOperation(horovod::tensorflow::{anonymous}::TensorTable&, horovod::tensorflow::MPIResponse)’:
horovod/tensorflow/mpi_ops.cc:802:79: # error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
                                  recvcounts, displcmnts, dtype, MPI_COMM_WORLD);
                                                                               ^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:633:5: error:   initializing argument 1 of ‘int MPI_Allgatherv(void*, int, MPI_Datatype, void*, int*, int*, MPI_Datatype, MPI_Comm) [-fpermissive]
 int MPI_Allgatherv(void* , int, MPI_Datatype, void*, int *, int *, MPI_Datatype, MPI_Comm);
     ^
horovod/tensorflow/mpi_ops.cc:1102:45: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
                               MPI_COMM_WORLD))
                                             ^

安装过程中出错:致命错误:pyconfig.h:没有这样的文件或目录

如果你看到以下错误信息,这意味着你需要安装 Python 头文件。

build/horovod/torch/mpi_lib/_mpi_lib.c:22:24: fatal error: pyconfig.h: No such file or directory
 #  include <pyconfig.h>
                        ^
compilation terminated.

你可以通过安装一个python-devpython3-dev包来实现。例如,在Debian或Ubuntu系统上:

$ sudo apt-get install python-dev

安装过程中未找到 NCCL 2

如果你看到以下错误信息,这意味着在标准库位置中未找到NCCL 2。如果你有一个安装了NCCL 2的目录,其中同时包含includelib目录,分别含有nccl.hlibnccl.so,你可以通过HOROVOD_NCCL_HOME环境变量传递它。否则,你可以分别通过HOROVOD_NCCL_INCLUDEHOROVOD_NCCL_LIB环境变量来指定它们。

build/temp.linux-x86_64-2.7/test_compile/test_nccl.cc:1:18: fatal error: nccl.h: No such file or directory
 #include <nccl.h>
                  ^
compilation terminated.
error: NCCL 2.0 library or its later version was not found (see error above).
Please specify correct NCCL location via HOROVOD_NCCL_HOME environment variable or combination of HOROVOD_NCCL_INCLUDE and HOROVOD_NCCL_LIB environment variables.

HOROVOD_NCCL_HOME - path where NCCL include and lib directories can be found
HOROVOD_NCCL_INCLUDE - path to NCCL include directory
HOROVOD_NCCL_LIB - path to NCCL lib directory

例如:

$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod

或者:

$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_INCLUDE=/path/to/nccl/include HOROVOD_NCCL_LIB=/path/to/nccl/lib pip install --no-cache-dir horovod

Pip 安装:无此选项:–no-cache-dir

如果你看到以下错误信息,这意味着你的pip版本已过时。你可以移除--no-cache-dir标志, 因为你的pip版本不支持缓存功能。--no-cache-dir标志被添加到所有示例中,是为了确保当你 更改Horovod编译标志时,它会从源代码重新构建,而不是仅仅从pip缓存中重新安装,这是 现代pip的默认行为

$ pip install --no-cache-dir horovod

Usage:
  pip install [options] <requirement specifier> ...
  pip install [options] -r <requirements file> ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --no-cache-dir

例如:

$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod

ncclAllReduce 失败:无效数据类型

如果在训练过程中看到以下错误消息,意味着Horovod链接到了错误版本的NCCL库。

UnknownError (see above for traceback): ncclAllReduce failed: invalid data type
         [[Node: DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_2_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/AddN_2)]]
         [[Node: train_op/_653 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1601_train_op", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:
0"]()]]

如果你正在使用Anaconda或Miniconda,很可能已经安装了nccl包。解决方案是移除该包并重新安装Horovod:

$ conda remove nccl
$ pip uninstall -y horovod
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod

transport/p2p.cu:431 警告 无法打开CUDA IPC句柄:30 未知错误

如果在使用-x NCCL_DEBUG=INFO进行训练时看到以下错误信息,很可能意味着多个服务器共享了相同的hostname

node1:22671:22795 [1] transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error

MPI和NCCL依赖主机名来区分服务器,因此您应确保每台服务器都有唯一的主机名。

内存不足

如果你注意到你的程序正在耗尽GPU内存,并且多个进程被放置在同一GPU上,很可能你的程序(或其依赖项)创建了一个不使用固定特定GPU的configtf.Session

如果可能,请找到使用这些额外 tf.Sessions 的程序部分,并传递相同的配置。

或者,您可以在程序开头放置以下代码片段,要求 TensorFlow 最小化其在每个 GPU 上预分配的内存量:

small_cfg = tf.ConfigProto()
small_cfg.gpu_options.allow_growth = True
with tf.Session(config=small_cfg):
    pass

作为最后的手段,你可以替换设置 config.gpu_options.visible_device_list 为不同的代码:

# Pin GPU to be used
import os
os.environ['CUDA_VISIBLE_DEVICES'] = str(hvd.local_rank())

注意: 设置 CUDA_VISIBLE_DEVICESconfig.gpu_options.visible_device_list 不兼容。

设置 CUDA_VISIBLE_DEVICES 对GPU版本还有另一个缺点 - CUDA将无法使用IPC,这很可能导致NCCL和MPI失败。为了在NCCL和MPI中禁用IPC并允许其回退到共享内存,请使用: * export NCCL_P2P_DISABLE=1 用于NCCL。 * --mca btl_smcuda_use_cuda_ipc 0 标志用于OpenMPI,以及其他供应商的类似标志。

libcudart.so.X.Y: 无法打开共享对象文件:没有这样的文件或目录

如果你注意到你的程序崩溃并出现 libcudart.so.X.Y: cannot open shared object file: No such file or directory 错误,这很可能是因为你的框架和 Horovod 是用不同版本的 CUDA 构建的。

要使用特定CUDA版本构建Horovod,请在安装过程中使用HOROVOD_CUDA_HOME环境变量:

$ pip uninstall -y horovod
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_HOME=/path/to/cuda pip install --no-cache-dir horovod

强制终止于 数据解包将读取超过缓冲区末尾

如果在训练过程中看到以下错误信息,很可能是因为你的系统中安装了错误版本的 hwloc

--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[25215,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------
[future5.stanford.edu:12508] [[25215,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355

从你的系统中清除 hwloc

$ apt purge hwloc-nox libhwloc-dev libhwloc-plugins libhwloc5

在清除 hwloc 后,重新安装 Open MPI

更多详情请参见此问题

使用 TensorFlow 1.14 或更高版本时出现段错误,提及 hwloc

如果您正在使用 TensorFlow 1.14 或 1.15 并遇到段错误,请检查是否提及 hwloc

… 信号:段错误 (11) 信号代码:地址未映射 (1) 失败地址:0x99 [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f309d34ff20] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_hwloc_base_free_topology+0x76)[0x7f3042871ca6] …

如果确实如此,这可能与从 TensorFlow 导出的 hwloc 符号冲突。

要解决此问题,请使用 ldconfig -p | grep libhwloc.so 定位您的 hwloc 库,然后设置 LD_PRELOAD。例如:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhwloc.so python -c ‘import horovod.tensorflow as hvd; hvd.init()’

更多信息请参见[此问题](https://github.com/horovod/horovod/issues/1123)。

bash: orted: 命令未找到

如果在训练过程中看到以下错误消息,很可能是 Open MPI 在 PATH 中找不到其某个组件。

bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

我们建议使用--enable-orterun-prefix-by-default标志重新安装Open MPI,如下所示:

$ wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.1.4.tar.gz
$ tar zxf openmpi-4.1.4.tar.gz
$ cd openmpi-4.1.4
$ ./configure --enable-orterun-prefix-by-default
$ make -j $(nproc) all
$ make install
$ ldconfig