故障排除 (AEN 4.1.2)#

概述

这是针对Anaconda Enterprise Notebooks部署的故障排除指南。

正常操作

服务器

Anaconda Enterprise Notebooks Server 安装在 /opt/wakari/wakari-server

您可以通过以下方式获取服务器进程的状态:

# service wakari-server status
wk-server                        RUNNING    pid 20758, uptime 5 days, 0:30:23
worker                           RUNNING    pid 20757, uptime 5 days, 0:30:23

或:

root@server # ps -Hu wakari
  PID TTY          TIME CMD
20756 ?        00:02:26 .supervisord
20757 ?        00:05:58   mtq-worker
20758 ?        00:00:08   wk-server
20765 ?        00:02:00     wk-server
20766 ?        00:01:55     wk-server
20767 ?        00:02:20     wk-server
20770 ?        00:02:02     wk-server
supervisord details
description Manages wakari-worker and multiple processes of wk-server
user wakari
configuration /opt/wakari/wakari-server/etc/supervisord.conf
log /opt/wakari/wakari-server/var/log/supervisord.log
control service wakari-server
ports none
wk-server details
description Handles user interaction and passing jobs on to the wakari gateway. Access to it is managed by nginx.
user wakari
command /opt/wakari/wakari-server/bin/wk-server
configuration /opt/wakari/wakari-server/etc/wakari/
control service wakari-server
logs /opt/wakari/wakari-server/var/log/wakari/server.log
ports 5000 (only on localhost)
wakari-worker details
description Asynchronously executes tasks from wk-server
user wakari
logs /opt/wakari/wakari-server/var/log/wakari/worker.log
control service wakari-server
nginx details
description Serves static files and acts as proxy for all other requests which are passed to wk-server process running on port 5000.
user nginx
configuration /etc/nginx/nginx.conf /opt/wakari/wakari-server/etc/conf.d/www.enterprise.conf
logs /var/log/nginx/woc.log /var/log/nginx/woc-error.log
control service nginx status
port 80

Nginx 至少运行两个进程:- 以 root 用户身份运行的主进程 - 以 nginx 用户身份运行的工作进程

网关

Anaconda Enterprise Notebooks Gateway 安装在 /opt/wakari/wakari-gateway

您可以通过以下方式获取网关进程的状态:

# service wakari-gateway status
wk-gateway                       RUNNING    pid 1137, uptime 5 days, 1:59:28

或:

root@gateway # ps -Hu wakari
  PID TTY          TIME CMD
 1136 ?        00:01:59 .supervisord
 1137 ?        00:00:02   wk-gateway
supervisord details
description Manages the wk-gateway process.
user wakari
configuration /opt/wakari/wakari-gateway/etc/supervisord.conf
log /opt/wakari/wakari-gateway/var/log/supervisord.log
control service wakari-gateway
ports none
wakari-gateway details
description Passes requests from Anaconda Enterprise Notebooks Server to the Compute Nodes.
user wakari
configuration /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json
logs
/opt/wakari/wakari-gateway/var/log/wakari/gateway.application.log
/opt/wakari/wakari-gateway/var/log/wakari/gateway.log
working dir / (root)
port 8089 (webcache)

计算节点

Anaconda Enterprise Notebooks Compute 安装在 /opt/wakari/wakari-compute

您可以通过以下方式获取计算节点进程的状态:

# service wakari-compute status
wk-compute                       RUNNING    pid 22050, uptime 3 days, 1:03:19

或:

root@compute # ps -Hu wakari
  PID TTY          TIME CMD
 1150 ?        00:02:01 .supervisord
 1152 ?        00:00:01   wk-compute

wk-compute 将按顺序加载这些配置文件:

  • /etc/wakari/config.json
  • /etc/wakari/compute-launcher-config.json
  • ./compute-launcher-config.json
  • Config file specified by -c option

如果一个选项在多个文件中被指定,最后遇到的那个将优先。

supervisord details
description Manages the wk-compute process.
user wakari
configuration /opt/wakari/wakari-compute/etc/supervisord.conf
log /opt/wakari/wakari-compute/var/log/supervisord.log
control service wakari-compute
working dir /opt/wakari/wakari-compute/etc
ports none
wk-compute details
description Launches compute processes
user wakari
configuration /opt/wakari/wakari-compute/etc/wakari/wk-compute-launcher-config.json /opt/wakari/wakari-compute/etc/wakari/scripts/config.json
logs /opt/wakari/wakari-compute/var/log/wakari/compute-launcher.application.log /opt/wakari/wakari-compute/var/log/wakari/compute-launcher.log
working dir / (root)
control service wakari-compute
port 5002 (rfe)

项目和权限

项目位于计算节点上的projectRoot文件夹中(默认情况下为/projects)。项目目录在项目首次启动时创建;start-project脚本从/opt/wakari/wakari-compute/lib/node_modules/wakari-compute-launcher/skeleton克隆它。

项目目录权限如下:

owner: rwx, user who created the project
group: rwx, owner's group
other: --x, to allow access to the Public folder
ACL:   rwx for any other team members

项目目录中的文件和子目录具有与项目目录相同的权限,除了:

  1. The public folder and everything in it are world readable.
  2. Any files hardlinked into the root anaconda environment (/opt/wakari/anaconda) remain owned by the root or wakari users.

项目文件和目录权限由start-project脚本维护。项目中的所有文件和目录在项目启动时都会设置其权限,除了由root或AEN_SRVC_ACCT用户(通常是wakariaen_admin)拥有的文件。由root或AEN_SRVC_ACCT用户拥有的文件不会更改其权限,以避免更改/opt/wakari/anaconda中链接文件的权限。

注意:不要以AEN_SRVC_ACCT用户(通常是wakariaen_admin)启动项目。权限系统将无法正确管理由该用户拥有的项目文件。

一般故障排除步骤

确保Anaconda Enterprise Notebooks服务设置为开机启动

(在所有3个组件上:服务器、网关和计算节点)

chkconfig --list | grep wakari

如果它们缺失,你可以尝试添加它们:

chkconfig --add [wakari-server|wakari-gateway|wakari-compute]

然后可以使用restart命令安全地启动服务,如下所示:

service wakari-server restart
service wakari-gateway restart
service wakari-compute restart

这些命令需要在适当的节点上执行。

确保所有服务都在运行

(参见上面的正常操作)。

# service wakari-server status
wk-server                        RUNNING    pid 20758, uptime 5 days, 0:30:23
worker                           RUNNING    pid 20757, uptime 5 days, 0:30:23

root@server # service nginx status
nginx (pid  26303) is running...

# service wakari-gateway status
wk-gateway                       RUNNING    pid 1137, uptime 5 days, 1:59:28

# service wakari-compute status
wk-compute                       RUNNING    pid 22050, uptime 3 days, 1:03:19

如果任何进程缺失,请使用上述命令重新启动它们。

检查多余进程

使用 ps -Hu wakari 获取在 wakari 用户账户下运行的进程的完整列表。

root@server # ps -Hu wakari
  PID TTY          TIME CMD
20756 ?        00:02:26 .supervisord
20757 ?        00:05:58   mtq-worker
20758 ?        00:00:08   wk-server
20765 ?        00:02:00     wk-server
20766 ?        00:01:55     wk-server
20767 ?        00:02:20     wk-server
20770 ?        00:02:02     wk-server

root@server # ps -f -C nginx
UID        PID  PPID  C STIME TTY          TIME CMD
root     26303     1  0 12:18 ?        00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx    26305 26303  0 12:18 ?        00:00:00 nginx: worker process

root@gateway # ps -Hu wakari
  PID TTY          TIME CMD
 1136 ?        00:01:59 .supervisord
 1137 ?        00:00:02   wk-gateway

root@compute # ps -Hu wakari
  PID TTY          TIME CMD
 1150 ?        00:02:01 .supervisord
 1152 ?        00:00:01   wk-compute

什么是正常的:

  • The wk-server, wk-gateway, and wk-compute processes should have the PIDs reported by supervisorctl.
  • The nginx master process should have the PID reported by service nginx status.
  • If you have installed more than one Anaconda Enterprise Notebooks component on a single machine, the processes from all of the installed components will show up on that machine.
  • On the Compute node, any Anaconda Enterprise Notebooks applications currently being run by users will be present. For example:
root@compute # ps -Hu wakari
  PID TTY          TIME CMD
 1150 ?        00:00:00 .supervisord
 1152 ?        00:00:00   wk-compute
 1340 ?        00:00:00 bash
 1341 ?        00:00:00   notebookwrapper

如果存在额外的 wk-server、wk-gateway、wk-compute 或 supervisord 进程,请使用 kill 命令将其移除。然后按照上述方法使用 service SERVICE_NAME restart 重新启动服务。

检查服务器之间的连接性

服务器到网关

在服务器上,导航到Admin/Data Centers。对于列表中的每个数据中心,检查从服务器到该网关的连接性(在本例中,网关是http://gateway.example.com:8089):

root@server # curl --connect-timeout 5 http://gateway.example.com:8089 > /dev/null

计算节点的网关

在服务器上,导航到管理/企业资源。对于列表中的每个计算资源,打开它并检查URL字段的内容,以确保它以“http”或“https”开头。从相应的网关检查到该URL的连接性。例如,如果URL是http://compute.example.com:5002

root@gateway # curl --connect-timeout 5 http://compute.example.com:5002 > /dev/null

服务器网关

此路径由网关配置命令wk-gateway-configure使用。首先,确保在配置文件中网关已链接到正确的服务器,并且指定了完整的服务器URL。然后检查与服务器的连接性。

root@gateway # grep WAKARI_SERVER /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json
  "WAKARI_SERVER": "http://wakari.example.com",

root@gateway # curl --connect-timeout 5 http://wakari.example.com > /dev/null
root@gateway # curl --connect-timeout 5 http://error.example.com > /dev/null
curl: (7) Failed to connect to error.example.com port 80: Connection refused

如果连接失败,请检查以下项目:

  • Ensure that Gateways (Data Centers) and Compute nodes (Enterprise Resources) are correctly configured on the server.
  • Verify that processes are listening on the configured ports:
root@server # netstat -plt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
tcp        0      0 *:http                      *:*                         LISTEN      26409/nginx
tcp        0      0 *:ssh                       *:*                         LISTEN      986/sshd
tcp        0      0 localhost:smtp              *:*                         LISTEN      1063/master
tcp        0      0 *:complex-main             *:*                         LISTEN      26192/python
tcp        0      0 localhost:27017             *:*                         LISTEN      29261/mongod
tcp        0      0 *:ssh                       *:*                         LISTEN      986/sshd
tcp        0      0 localhost:smtp              *:*                         LISTEN      1063/master
  • Check firewall settings/logs on both hosts to ensure that packets are not being blocked or discarded.

检查配置文件语法

使用此命令验证配置文件是否包含有效的JSON:

root@server  # python -m json.tool /opt/wakari/wakari-server/etc/wakari/*.json
root@gateway # python -m json.tool /opt/wakari/wakari-gateway/etc/wakari/*.json
root@compute # python -m json.tool /opt/wakari/wakari-compute/etc/wakari/*.json

如果文件正确,内容将会显示。如果文件中存在语法错误,则会显示消息 No JSON object could be decoded。编辑配置文件,确保JSON语法正确。

检查文件所有权

验证 /opt/wakari/anaconda 中的所有文件属于用户/组 wakari:

root@server # find /opt/wakari/anaconda \! -user wakari -print
root@server # find /opt/wakari/anaconda \! -group wakari -print

如果输出中列出了任何文件,请修复它们的所有权:

chown -R wakari:wakari /opt/wakari/anaconda

验证POSIX ACLs是否已启用

必须在包含项目根目录的文件系统上启用acl选项。

首先,确定项目的根目录。如果配置了自定义的projectRoot,您可以通过以下方式确定它:

root@compute # grep projectRoot /opt/wakari/wakari-compute/etc/wakari/config.json

如果不是,项目根目录是 /projects

无论是mount选项还是tune2fs列出的默认选项,都应表明acl选项已启用。

root@compute # fs=`df /projects | tail -1 | cut -d " " -f 1`
root@compute # mount | grep $fs
/dev/vda on / type ext4 (rw)
root@compute # tune2fs -l $fs | grep options
Default mount options:    user_xattr acl

清除浏览器Cookie

当Anaconda Enterprise Notebooks配置更改或软件升级时,浏览器中剩余的cookie可能会导致问题。清除cookie并重新登录可以帮助解决问题。

具体问题

Problem Cause Solution
Browser indicates “too many redirects” Cookies are out of date Clear your browser’s cookies and cache, then try again.
supervisorctl error: “unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file” “supervisord” is not running on the Server Ensure that supervisord is included in the crontab, as described above. Then start supervisord manually.
Data Center Not Found message when deleting a project Datacenter has already been removed As root, run /opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project>
Forgotten administrator password   Use ssh to log in to the server as root, and run the command /opt/wakari/wakari-server/bin/wk-server-admin add-user wakari --admin -p <new password> -e <your email>. You can then log in to Anaconda Enterprise Notebooks as the wakari user with the new password you chose.

日志

每个进程和应用程序的Anaconda Enterprise Notebooks日志文件的位置显示在上面的表格中。

Anaconda Enterprise Notebooks 安装程序登录到 /tmp/wakari_{server,gateway,compute}.log。

如果日志文件变得太大,可以删除它们。为了使日志更加详细或简洁,Jupyter Notebook系统有一个设置‘Application.log_level’。将‘Application.log_level’设置为‘ERROR’将使日志比默认设置更简洁,但仍然相当信息丰富。

杀死了supervisord并出现“错误:此套接字已关闭。”

当监控守护进程“supervisord”被终止时,发送到标准输出“stdout”和标准错误“stderr”的信息会被保留在一个管道中,最终会填满。然后尝试启动任何应用程序都会失败,并显示错误消息“此套接字已关闭”。

为了防止这个问题,始终要干净地关闭并重新启动进程,并且不要在不首先关闭wk-compute和其他使用它的进程的情况下关闭或终止supervisord。

要从此问题中恢复,请使用sudo kill -9关闭进程“wk-compute”。然后重新启动supervisord和wk-compute进程:

sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start

服务错误 502: 无法连接到应用程序管理器

当网关节点显示此错误时,意味着计算资源没有响应。

当进程“wk-compute”被关闭时,会导致此错误。要从此问题中恢复,请重新启动supervisord和wk-compute进程:

sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start

亚马逊网络服务上的“502通信错误”

如果您看到一个页面显示“502 通信错误:此网关无法与 Wakari 服务器通信”以及 Wakari 服务器的 IP 地址,请配置 AEN 网关以使用服务器的 DNS 主机名。在亚马逊网络服务(AWS)上,这将是亚马逊弹性计算云(EC2)实例的 DNS 主机名。

无效的用户名

用户名的第一个字符必须是字母 [a-z] 或数字 [0-9]。

用户名中的每个其他字符可以是字母 [a-z]、数字 [0-9]、句点 [.]、下划线 [_] 或连字符 [-]。

POSIX标准规定这些字符是可移植文件名字符集,并且可移植用户名具有相同的字符集。

Anaconda Enterprise Notebooks 用户名应至少为3个字符,且不超过25个字符。