PyTorch Lighting 常见问题整理
0 系统版本
PyTorch Lighting 1.1.7
1 错误:object has no attribute '_metrics_to_agg'
当使用自定义的 Logger 时,如果出现上面的错误,一个很可能的原因是在自定义的 __init__ 函数中忘记调用 super().__init__(),正确的自定义初始化函数用法如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | class MyLogger(LightningLoggerBase): def __init__(self, a): # Don't forget to call this super().__init__() self.a = a @property def name(self): return 'MyLogger' @property @rank_zero_experiment def experiment(self): # Return the experiment object associated with this logger. pass @property def version(self): # Return the experiment version, int or str. return '0.1' @rank_zero_only def log_hyperparams(self, params): # params is an argparse.Namespace # your code to record hyperparameters goes here pass @rank_zero_only def log_metrics(self, metrics, step): # metrics is a dictionary of metric names and values # your code to record metrics goes here pass @rank_zero_only def save(self): # Optional. Any code necessary to save logger data goes here # If you implement this, remember to call `super().save()` # at the start of the method (important for aggregation of metrics) super().save() @rank_zero_only def finalize(self, status): # Optional. Any code that needs to be run after training # finishes goes here pass |
2 错误:RuntimeError: grad can be implicitly created only for scalar outputs
在使用 DataParallel 模式时,由于 Loss 是每个 train_step 分别返回的,在单个 GPU 中返回是一个标量,但是多个 GPU 返回就是一个 Tensor 向量了,而 Loss 要求必须是标量。
解决这一问题可以增加如下 training_step_end 函数,将 loss 求 mean:
1 2 3 4 5 6 7 8 9 | def training_step_end(self, outputs): if outputs is None: return None if outputs['loss'] is None: return None return {'epoch': self.current_epoch, 'loss': outputs['loss'].mean()} |
需要注意的是上述函数默认在 train_step 返回的是如下形式,如果不是的话请进行相应修改。
1 | return {'loss': loss} |
同时 PyTorch Lightning 默认是接收直接返回 Loss 或者在 Dict 中返回 {'loss': loss},其他形式可能需要再做其他修改。
3 问题:分布式训练中 training_step、validation_step、test_step 的线程安全
在 PyTorch Lightning 的一些例子中,可能会在 training_step 中进行一些 matplotlib 的绘制操作,这在单线程中是没有问题的,但是在例如用 dp、ddp 等多线程模式时,training_step、validation_step、test_step 实际上是并行的,而 matplotlib 并不是线程安全的,就会出现死锁或者崩溃等问题。
因此如果是用分布式训练,请务必确保这几个函数中的代码都是线程安全的。
参考官方文档可以更好地理解这个问题:
https://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html#methods
对于单线程训练其流程类似:
1 2 3 4 5 6 7 8 9 10 11 12 13 | outs = [] for batch in train_dataloader: # forward out = training_step(val_batch) # backward loss.backward() # apply and clear grads optimizer.step() optimizer.zero_grad() training_epoch_end(outs) |
对于分布式训练其流程类似:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | outs = [] for train_batch in train_dataloader: batches = split_batch(train_batch) dp_outs = [] for sub_batch in batches: # 1 dp_out = training_step(sub_batch) dp_outs.append(dp_out) # 2 out = training_step_end(dp_outs) outs.append(out) # do something with the outputs for all batches # 3 training_epoch_end(outs) |
一个可行的解决办法对于这部分代码使用线程锁,例如:
在 LightningModule 的 __init__ 函数中加上:
1 | self.mutex = threading.Lock() |
在 train_step 等函数非线程安全的代码中使用:
1 2 3 | with self.mutex: # Your drawing code here |
4 技巧:使用 @rank_zero_only 修饰多线程中只在 RANK=0 调用的函数
在分布式训练中,如果有一些日志或者测试进程只应该在 RANK=0 中调用,可以考虑将相关的代码放入一个函数,同时该函数使用 @rank_zero_only 进行修饰,例如:
1 2 | @rank_zero_only def do_evaluate_matches(self): |
注意需要引用头文件:
1 | from pytorch_lightning.utilities import rank_zero_only |
5 错误:AttributeError: Missing attribute "training_step_output_for_epoch_end"
在使用 DDP 等分布式训练模式时如果当前的 batch 返回 None,可能会出现如下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | Traceback (most recent call last): File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/utilities/parsing.py", line 183, in __getattr__ return self[key] KeyError: 'training_step_output_for_epoch_end' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/Develop/SuperGlue/train_pl.py", line 165, in <module> main() File "/data/Develop/SuperGlue/train_pl.py", line 161, in main trainer.fit(superglue_model, match_datamodule) File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 472, in fit results = self.accelerator_backend.train() File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 307, in ddp_train results = self.train_or_test() File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test results = self.trainer.train() File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 523, in train self.train_loop.run_training_epoch() File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 582, in run_training_epoch batch_output.training_step_output_for_epoch_end, File "/home/liuxiao/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/pytorch_lightning/utilities/parsing.py", line 185, in __getattr__ raise AttributeError(f'Missing attribute "{key}"') from exp AttributeError: Missing attribute "training_step_output_for_epoch_end" |
通过对照其代码,一个可能的问题是在 batch=None 的时候可能有一个逻辑错误。在 trainer/training_loop.py 文件中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # ------------------------------------ # TRAINING_STEP + TRAINING_STEP_END # ------------------------------------ with self.trainer.profiler.profile("run_training_batch"): batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) # when returning -1 from train_step, we end epoch early if batch_output.signal == -1: break # only track outputs when user implements training_epoch_end # otherwise we will build up unnecessary memory epoch_end_outputs = self.process_train_step_outputs( batch_output.training_step_output_for_epoch_end, self.early_stopping_accumulator, self.checkpoint_accumulator, ) |
而在 run_training_batch 函数中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def run_training_batch(self, batch, batch_idx, dataloader_idx): # track grad norms grad_norm_dic = {} # bookkeeping using_results_obj = False self.trainer.hiddens = None # track all outputs across time and num of optimizers batch_outputs = [[] for _ in range(len(self.get_optimizers_iterable()))] if batch is None: return AttributeDict(signal=0, grad_norm_dic=grad_norm_dic) # hook response = self.trainer.call_hook("on_batch_start") if response == -1: return AttributeDict(signal=-1, grad_norm_dic=grad_norm_dic) |
也就是说在 batch == None 的时候会返回 signal=0 的 batch_output,而这个并不存在 batch_output.training_step_output_for_epoch_end 属性,因此就出现的错误。
目前并不清楚在哪个版本中会解决这个问题,一个中间的处理方式是不要返回 None 的 batch,可以返回类似:
1 2 3 4 | if len(batch) == 0: return { 'valid_batch': False } |
然后在 training_step 增加如下处理:
1 2 3 4 5 6 7 | def training_step(self, data, batch_idx): if data is None: return None if data['valid_batch'] == False: # pbar.update(batch_size) # print('data = {}'.format(data)) return None |
6 使用 ReduceLROnPlateau 的方式
在 PyTorch Lightning 默认的 automatic_optimization = True 方式下,你需要这样配置:
1 2 3 4 5 | def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate) lr_scheduler = {'scheduler': torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5), 'monitor': 'val_loss'} return {'optimizer': optimizer, 'lr_scheduler': lr_scheduler} |
与其他略有不同,ReduceLROnPlateau 需要配置 monitor 参数,如果没有配置则会出现如下错误:
configure_optimizers
must include a monitor when a ReduceLROnPlateau
scheduler is used.
7 使用 Manual Optimization 方式手工控制优化步骤
如果你需要更精细的控制,可以不启用 PyTorch Lightning 的 automatic_optimization 功能(设置为 False),采用类似如下方式手动优化:
比如采用 Optimizer 方式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from pytorch_lightning import LightningModule class MyModel(LightningModule): def __init__(self): super().__init__() # Important: This property activates manual optimization. self.automatic_optimization = False def training_step(batch, batch_idx): opt = self.optimizers() opt.zero_grad() loss = self.compute_loss(batch) self.manual_backward(loss) opt.step() |
比如采用 lr_scheduler 方式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # step every batch def __init__(self): super().__init__() self.automatic_optimization = False def training_step(self, batch, batch_idx): # do forward, backward, and optimization ... # single scheduler sch = self.lr_schedulers() sch.step() # multiple schedulers sch1, sch2 = self.lr_schedulers() sch1.step() sch2.step() |
参考:
[1] https://pytorch-lightning.readthedocs.io/en/latest/common/optimizers.html#manual-optimization
[2] https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau
[3] https://pytorch-lightning.readthedocs.io/en/latest/common/optimizers.html#learning-rate-scheduling-manual
8 使用 LearningRateMonitor 监控 learning rate 变化
使用 LearningRateMonitor 监控 learning rate 变化方式方式,参考代码如下,作为 callback 输入 trainer:
1 2 3 4 5 6 7 8 | import pytorch_lightning as pl from pytorch_lightning.callbacks import LearningRateMonitor lr_monitor = LearningRateMonitor(logging_interval='step') trainer = pl.Trainer(gpus=ngpus_per_node, callbacks=[..., lr_monitor], ...) |
为了获取 learning rate,可以在 Logger::log_metrics(self, metrics, step) 中获取,通常这个名字为 lrAdam 之类的:
1 2 3 4 5 | for key in metrics: # Log Learning Rate if re.search('lr', key) is not None: lr = metrics[key] |
9 在 LightningModule 获取当前 epoch 和 step
获得当前 epoch:
1 | curr_epoch = self.current_epoch |
获得当前 step:
1 | curr_step = self.global_step |