故障排除 (AEN 4.1.2)#
概述¶
这是针对Anaconda Enterprise Notebooks部署的故障排除指南。
正常操作¶
服务器¶
Anaconda Enterprise Notebooks Server 安装在
/opt/wakari/wakari-server
。
您可以通过以下方式获取服务器进程的状态:
# service wakari-server status
wk-server RUNNING pid 20758, uptime 5 days, 0:30:23
worker RUNNING pid 20757, uptime 5 days, 0:30:23
或:
root@server # ps -Hu wakari
PID TTY TIME CMD
20756 ? 00:02:26 .supervisord
20757 ? 00:05:58 mtq-worker
20758 ? 00:00:08 wk-server
20765 ? 00:02:00 wk-server
20766 ? 00:01:55 wk-server
20767 ? 00:02:20 wk-server
20770 ? 00:02:02 wk-server
supervisord | details |
---|---|
description | Manages wakari-worker and multiple processes of wk-server |
user | wakari |
configuration | /opt/wakari/wakari-server/etc/supervisord.conf |
log | /opt/wakari/wakari-server/var/log/supervisord.log |
control | service wakari-server |
ports | none |
wk-server | details |
---|---|
description | Handles user interaction and passing jobs on to the wakari gateway. Access to it is managed by nginx. |
user | wakari |
command | /opt/wakari/wakari-server/bin/wk-server |
configuration | /opt/wakari/wakari-server/etc/wakari/ |
control | service wakari-server |
logs | /opt/wakari/wakari-server/var/log/wakari/server.log |
ports | 5000 (only on localhost) |
wakari-worker | details |
---|---|
description | Asynchronously executes tasks from wk-server |
user | wakari |
logs | /opt/wakari/wakari-server/var/log/wakari/worker.log |
control | service wakari-server |
nginx | details |
---|---|
description | Serves static files and acts as proxy for all other requests which are passed to wk-server process running on port 5000. |
user | nginx |
configuration | /etc/nginx/nginx.conf
/opt/wakari/wakari-server/etc/conf.d/www.enterprise.conf |
logs | /var/log/nginx/woc.log /var/log/nginx/woc-error.log |
control | service nginx status |
port | 80 |
Nginx 至少运行两个进程:- 以 root 用户身份运行的主进程 - 以 nginx 用户身份运行的工作进程
网关¶
Anaconda Enterprise Notebooks Gateway 安装在
/opt/wakari/wakari-gateway
。
您可以通过以下方式获取网关进程的状态:
# service wakari-gateway status
wk-gateway RUNNING pid 1137, uptime 5 days, 1:59:28
或:
root@gateway # ps -Hu wakari
PID TTY TIME CMD
1136 ? 00:01:59 .supervisord
1137 ? 00:00:02 wk-gateway
supervisord | details |
---|---|
description | Manages the wk-gateway process. |
user | wakari |
configuration | /opt/wakari/wakari-gateway/etc/supervisord.conf |
log | /opt/wakari/wakari-gateway/var/log/supervisord.log |
control | service wakari-gateway |
ports | none |
wakari-gateway | details |
---|---|
description | Passes requests from Anaconda Enterprise Notebooks Server to the Compute Nodes. |
user | wakari |
configuration | /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json |
logs |
|
working dir | / (root) |
port | 8089 (webcache) |
计算节点¶
Anaconda Enterprise Notebooks Compute 安装在
/opt/wakari/wakari-compute
。
您可以通过以下方式获取计算节点进程的状态:
# service wakari-compute status
wk-compute RUNNING pid 22050, uptime 3 days, 1:03:19
或:
root@compute # ps -Hu wakari
PID TTY TIME CMD
1150 ? 00:02:01 .supervisord
1152 ? 00:00:01 wk-compute
wk-compute 将按顺序加载这些配置文件:
/etc/wakari/config.json
/etc/wakari/compute-launcher-config.json
./compute-launcher-config.json
- Config file specified by
-c
option
如果一个选项在多个文件中被指定,最后遇到的那个将优先。
supervisord | details |
---|---|
description | Manages the wk-compute process. |
user | wakari |
configuration | /opt/wakari/wakari-compute/etc/supervisord.conf |
log | /opt/wakari/wakari-compute/var/log/supervisord.log |
control | service wakari-compute |
working dir | /opt/wakari/wakari-compute/etc |
ports | none |
wk-compute | details |
---|---|
description | Launches compute processes |
user | wakari |
configuration | /opt/wakari/wakari-compute/etc/wakari/wk-compute-launcher-config.json
/opt/wakari/wakari-compute/etc/wakari/scripts/config.json |
logs | /opt/wakari/wakari-compute/var/log/wakari/compute-launcher.application.log
/opt/wakari/wakari-compute/var/log/wakari/compute-launcher.log |
working dir | / (root) |
control | service wakari-compute |
port | 5002 (rfe) |
项目和权限¶
项目位于计算节点上的projectRoot文件夹中(默认情况下为/projects)。项目目录在项目首次启动时创建;start-project脚本从/opt/wakari/wakari-compute/lib/node_modules/wakari-compute-launcher/skeleton
克隆它。
项目目录权限如下:
owner: rwx, user who created the project
group: rwx, owner's group
other: --x, to allow access to the Public folder
ACL: rwx for any other team members
项目目录中的文件和子目录具有与项目目录相同的权限,除了:
- The public folder and everything in it are world readable.
- Any files hardlinked into the root anaconda environment
(
/opt/wakari/anaconda
) remain owned by theroot
orwakari
users.
项目文件和目录权限由start-project脚本维护。项目中的所有文件和目录在项目启动时都会设置其权限,除了由root
或AEN_SRVC_ACCT用户(通常是wakari
或aen_admin
)拥有的文件。由root
或AEN_SRVC_ACCT用户拥有的文件不会更改其权限,以避免更改/opt/wakari/anaconda
中链接文件的权限。
注意:不要以AEN_SRVC_ACCT用户(通常是wakari
或aen_admin
)启动项目。权限系统将无法正确管理由该用户拥有的项目文件。
一般故障排除步骤¶
确保Anaconda Enterprise Notebooks服务设置为开机启动¶
(在所有3个组件上:服务器、网关和计算节点)
chkconfig --list | grep wakari
如果它们缺失,你可以尝试添加它们:
chkconfig --add [wakari-server|wakari-gateway|wakari-compute]
然后可以使用restart
命令安全地启动服务,如下所示:
service wakari-server restart
service wakari-gateway restart
service wakari-compute restart
这些命令需要在适当的节点上执行。
确保所有服务都在运行¶
(参见上面的正常操作)。
# service wakari-server status
wk-server RUNNING pid 20758, uptime 5 days, 0:30:23
worker RUNNING pid 20757, uptime 5 days, 0:30:23
root@server # service nginx status
nginx (pid 26303) is running...
# service wakari-gateway status
wk-gateway RUNNING pid 1137, uptime 5 days, 1:59:28
# service wakari-compute status
wk-compute RUNNING pid 22050, uptime 3 days, 1:03:19
如果任何进程缺失,请使用上述命令重新启动它们。
检查多余进程¶
使用 ps -Hu wakari
获取在 wakari
用户账户下运行的进程的完整列表。
root@server # ps -Hu wakari
PID TTY TIME CMD
20756 ? 00:02:26 .supervisord
20757 ? 00:05:58 mtq-worker
20758 ? 00:00:08 wk-server
20765 ? 00:02:00 wk-server
20766 ? 00:01:55 wk-server
20767 ? 00:02:20 wk-server
20770 ? 00:02:02 wk-server
root@server # ps -f -C nginx
UID PID PPID C STIME TTY TIME CMD
root 26303 1 0 12:18 ? 00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx 26305 26303 0 12:18 ? 00:00:00 nginx: worker process
root@gateway # ps -Hu wakari
PID TTY TIME CMD
1136 ? 00:01:59 .supervisord
1137 ? 00:00:02 wk-gateway
root@compute # ps -Hu wakari
PID TTY TIME CMD
1150 ? 00:02:01 .supervisord
1152 ? 00:00:01 wk-compute
什么是正常的:
- The wk-server, wk-gateway, and wk-compute processes should have the
PIDs reported by
supervisorctl
. - The nginx master process should have the PID reported by
service nginx status
. - If you have installed more than one Anaconda Enterprise Notebooks component on a single machine, the processes from all of the installed components will show up on that machine.
- On the Compute node, any Anaconda Enterprise Notebooks applications currently being run by users will be present. For example:
root@compute # ps -Hu wakari
PID TTY TIME CMD
1150 ? 00:00:00 .supervisord
1152 ? 00:00:00 wk-compute
1340 ? 00:00:00 bash
1341 ? 00:00:00 notebookwrapper
如果存在额外的 wk-server、wk-gateway、wk-compute 或 supervisord 进程,请使用 kill
命令将其移除。然后按照上述方法使用 service SERVICE_NAME restart
重新启动服务。
检查服务器之间的连接性¶
服务器到网关¶
在服务器上,导航到Admin/Data Centers。对于列表中的每个数据中心,检查从服务器到该网关的连接性(在本例中,网关是http://gateway.example.com:8089
):
root@server # curl --connect-timeout 5 http://gateway.example.com:8089 > /dev/null
计算节点的网关¶
在服务器上,导航到管理/企业资源。对于列表中的每个计算资源,打开它并检查URL字段的内容,以确保它以“http”或“https”开头。从相应的网关检查到该URL的连接性。例如,如果URL是http://compute.example.com:5002
:
root@gateway # curl --connect-timeout 5 http://compute.example.com:5002 > /dev/null
服务器网关¶
此路径由网关配置命令wk-gateway-configure
使用。首先,确保在配置文件中网关已链接到正确的服务器,并且指定了完整的服务器URL。然后检查与服务器的连接性。
root@gateway # grep WAKARI_SERVER /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json
"WAKARI_SERVER": "http://wakari.example.com",
root@gateway # curl --connect-timeout 5 http://wakari.example.com > /dev/null
root@gateway # curl --connect-timeout 5 http://error.example.com > /dev/null
curl: (7) Failed to connect to error.example.com port 80: Connection refused
如果连接失败,请检查以下项目:
- Ensure that Gateways (Data Centers) and Compute nodes (Enterprise Resources) are correctly configured on the server.
- Verify that processes are listening on the configured ports:
root@server # netstat -plt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:http *:* LISTEN 26409/nginx
tcp 0 0 *:ssh *:* LISTEN 986/sshd
tcp 0 0 localhost:smtp *:* LISTEN 1063/master
tcp 0 0 *:complex-main *:* LISTEN 26192/python
tcp 0 0 localhost:27017 *:* LISTEN 29261/mongod
tcp 0 0 *:ssh *:* LISTEN 986/sshd
tcp 0 0 localhost:smtp *:* LISTEN 1063/master
- Check firewall settings/logs on both hosts to ensure that packets are not being blocked or discarded.
检查配置文件语法¶
使用此命令验证配置文件是否包含有效的JSON:
root@server # python -m json.tool /opt/wakari/wakari-server/etc/wakari/*.json
root@gateway # python -m json.tool /opt/wakari/wakari-gateway/etc/wakari/*.json
root@compute # python -m json.tool /opt/wakari/wakari-compute/etc/wakari/*.json
如果文件正确,内容将会显示。如果文件中存在语法错误,则会显示消息
No JSON object could be decoded
。编辑配置文件,确保JSON语法正确。
检查文件所有权¶
验证 /opt/wakari/anaconda 中的所有文件属于用户/组
wakari
:
root@server # find /opt/wakari/anaconda \! -user wakari -print
root@server # find /opt/wakari/anaconda \! -group wakari -print
如果输出中列出了任何文件,请修复它们的所有权:
chown -R wakari:wakari /opt/wakari/anaconda
验证POSIX ACLs是否已启用¶
必须在包含项目根目录的文件系统上启用acl
选项。
首先,确定项目的根目录。如果配置了自定义的projectRoot,您可以通过以下方式确定它:
root@compute # grep projectRoot /opt/wakari/wakari-compute/etc/wakari/config.json
如果不是,项目根目录是 /projects
。
无论是mount
选项还是tune2fs
列出的默认选项,都应表明acl
选项已启用。
root@compute # fs=`df /projects | tail -1 | cut -d " " -f 1`
root@compute # mount | grep $fs
/dev/vda on / type ext4 (rw)
root@compute # tune2fs -l $fs | grep options
Default mount options: user_xattr acl
清除浏览器Cookie¶
当Anaconda Enterprise Notebooks配置更改或软件升级时,浏览器中剩余的cookie可能会导致问题。清除cookie并重新登录可以帮助解决问题。
具体问题¶
Problem | Cause | Solution |
---|---|---|
Browser indicates “too many redirects” | Cookies are out of date | Clear your browser’s cookies and cache, then try again. |
supervisorctl error: “unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file” | “supervisord” is not running on the Server | Ensure that supervisord is included in the crontab, as described above. Then start supervisord manually. |
Data Center Not Found message when deleting a project | Datacenter has already been removed | As root, run /opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project> |
Forgotten administrator password | Use ssh to log in to the server as root, and run the command /opt/wakari/wakari-server/bin/wk-server-admin add-user wakari --admin -p <new password> -e <your email> . You can then log in to Anaconda Enterprise Notebooks as the wakari user with the new password you chose. |
日志¶
每个进程和应用程序的Anaconda Enterprise Notebooks日志文件的位置显示在上面的表格中。
Anaconda Enterprise Notebooks 安装程序登录到 /tmp/wakari_{server,gateway,compute}.log。
如果日志文件变得太大,可以删除它们。为了使日志更加详细或简洁,Jupyter Notebook系统有一个设置‘Application.log_level’。将‘Application.log_level’设置为‘ERROR’将使日志比默认设置更简洁,但仍然相当信息丰富。
杀死了supervisord并出现“错误:此套接字已关闭。”¶
当监控守护进程“supervisord”被终止时,发送到标准输出“stdout”和标准错误“stderr”的信息会被保留在一个管道中,最终会填满。然后尝试启动任何应用程序都会失败,并显示错误消息“此套接字已关闭”。
为了防止这个问题,始终要干净地关闭并重新启动进程,并且不要在不首先关闭wk-compute和其他使用它的进程的情况下关闭或终止supervisord。
要从此问题中恢复,请使用sudo kill -9
关闭进程“wk-compute”。然后重新启动supervisord和wk-compute进程:
sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start
服务错误 502: 无法连接到应用程序管理器¶
当网关节点显示此错误时,意味着计算资源没有响应。
当进程“wk-compute”被关闭时,会导致此错误。要从此问题中恢复,请重新启动supervisord和wk-compute进程:
sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start
亚马逊网络服务上的“502通信错误”¶
如果您看到一个页面显示“502 通信错误:此网关无法与 Wakari 服务器通信”以及 Wakari 服务器的 IP 地址,请配置 AEN 网关以使用服务器的 DNS 主机名。在亚马逊网络服务(AWS)上,这将是亚马逊弹性计算云(EC2)实例的 DNS 主机名。