使用aim远程服务器跟踪实验
概述
Aim远程跟踪服务器支持在多主机环境中运行实验,并将追踪数据集中收集到一个中心位置。 它提供了用于客户端-服务器通信的SDK,并以http/ws协议作为核心传输层。
在本指南中,我们将向您展示如何设置Aim远程跟踪服务器以及如何在客户端代码中集成它。
先决条件
远程跟踪服务器适用于运行多个训练实验的多主机环境。运行该服务器的机器需要在专用端口(默认为53800)上接受传入的HTTP流量。
服务器端设置
确保已安装 aim
3.4.0或更高版本:$ pip install "aim>=3.4.0"
初始化
aim仓库(可选):
$ aim init
使用专用的
aim仓库运行aim服务器:
$ aim server --repo <REPO_PATH>
您将看到以下输出:
> Server is mounted on 0.0.0.0:53800
> Press Ctrl+C to exit
服务器已启动并准备接收追踪数据。
运行aim用户界面
$ aim up --repo <REPO_PATH>
客户端设置
在当前架构下,使用aim SDK几乎无需任何改动。与本地跟踪的唯一区别在于您需要提供远程跟踪URL而非本地aim仓库路径。以下代码展示了如何通过远程跟踪URL创建Run对象及其使用方法。
from aim import Run
aim_run = Run(repo='aim://172.3.66.145:53800') # replace example IP with your tracking server IP/hostname
# Log run parameters
aim_run['params'] = {
'learning_rate': 0.001,
'batch_size': 32,
}
...
您现在可以使用aim_run对象来追踪您的实验结果。以下是使用pytorch在MNIST数据集上进行远程追踪的完整示例。
from aim import Run
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
# Initialize a new Run with remote tracking URL
aim_run = Run(repo='aim://172.3.66.145:53800') # replace example IP with your tracking server IP/hostname
# Device configuration
device = torch.device('cpu')
# Hyper parameters
num_epochs = 5
num_classes = 10
batch_size = 16
learning_rate = 0.01
# aim - Track hyper parameters
aim_run['hparams'] = {
'num_epochs': num_epochs,
'num_classes': num_classes,
'batch_size': batch_size,
'learning_rate': learning_rate,
}
# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data/',
train=True,
transform=transforms.ToTensor(),
download=True)
test_dataset = torchvision.datasets.MNIST(root='./data/',
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=False)
# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.fc = nn.Linear(7 * 7 * 32, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out
model = ConvNet(num_classes).to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % 30 == 0:
print('Epoch [{}/{}], Step [{}/{}], '
'Loss: {:.4f}'.format(epoch + 1, num_epochs, i + 1,
total_step, loss.item()))
# aim - Track model loss function
aim_run.track(loss.item(), name='loss', epoch=epoch,
context={'subset':'train'})
correct = 0
total = 0
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
acc = 100 * correct / total
# aim - Track metrics
aim_run.track(acc, name='accuracy', epoch=epoch, context={'subset': 'train'})
if i % 300 == 0:
aim_run.track(loss.item(), name='loss', epoch=epoch, context={'subset': 'val'})
aim_run.track(acc, name='accuracy', epoch=epoch, context={'subset': 'val'})
# Test the model
model.eval()
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Test Accuracy: {} %'.format(100 * correct / total))
SSL支持
Aim Remote Tracking服务器可以配置使用SSL证书以安全模式运行。
要为服务器启用安全模式,请向aim server命令提供--ssl-keyfile和--ssl-certfile参数,
其中--ssl-keyfile是证书私钥文件的路径,--ssl-certfile是已签名证书的路径
(查看Aim CLI了解更多)。
如果您使用的是自签名证书,客户端必须进行相应配置,否则客户端会自动检测服务器是否支持安全连接。 请将AIM_CLIENT_SSL_CERTIFICATES_FILE环境变量设置为包含PEM编码根证书的文件路径,例如:
export __AIM_CLIENT_SSL_CERTIFICATES_FILE__=/path/of/the/certs/file
**注意:** 为了方便仅向客户端提供一个文件,私钥也将从证书文件中获取。 以下示例可实现此目的:
# generate the cert and key files
openssl genrsa -out server.key 2048
openssl req -new -x509 -sha256 -key server.key -out server.crt -days 3650 -subj '/CN={DOMAIN_NAME}'
# append the private key to the certs in a new file
cat server.crt server.key > server.includesprivatekey.pem
# set the env variable for aim client
export __AIM_CLIENT_SSL_CERTIFICATES_FILE__=./server.includesprivatekey.pem
结论
如您所见,aim远程跟踪服务器只需简单设置并对训练代码进行最小改动,即可在多个主机上运行实验。