Releases: Lightning-AI/pytorch-lightning
DDP bug fixes
We had a few (subtle) bugs that affected DDP and a few key things in 0.7.2 so we released 0.7.3 to fix them because they are critical for DDP. sorry about that! still, no API changes, but please do skip straight to 0.7.3 upgrade for those fixes
Detail changes
Added
- Added
rank_zero_warn
for warning only in rank 0 (#1428)
Fixed
- Fixed default
DistributedSampler
for DDP training (#1425) - Fixed workers warning not on windows (#1430)
- Fixed returning tuple from
run_training_batch
(#1431) - Fixed gradient clipping (#1438)
- Fixed pretty print (#1441)
Contributors
Many bug fixes, added flexibility, parity tests with pytorch and more
Overview
This release aims at fixing particular issues and improving the user development experience via extending docs, adding typing and supporting python 3.8. In particular, some of the release highlights are:
- Added benchmark for comparing lightning with vanilla implementations
- Extended optimizer support with particular frequency
- Several improvements for loggers such as represent no-primitive types, supporting hierarchical dictionaries for hyper param searchers
- Added model configuration checking before it runs
- Simplify the PL examples structure (shallower and more readable)
- Improved Trainer CLI arguments handling (generalization)
- Two Trainer argument become deprecated:
print_nan_grads
andshow_progress_bar
Detail changes
Added
- Added same step loggers' metrics aggregation (#1278)
- Added parity test between a vanilla MNIST model and lightning model (#1284)
- Added parity test between a vanilla RNN model and lightning model (#1351)
- Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
- Added support for hierarchical
dict
(#1152) - Added
TrainsLogger
class (#1122) - Added type hints to
pytorch_lightning.core
(#946) - Added support for
IterableDataset
in validation and testing (#1104) - Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130) - Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097) - Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283) - Added
summary
method to Profilers. (#1259) - Added informative errors if user defined dataloader has zero length (#1280)
- Added testing for python 3.8 (#915)
- Added a
training_epoch_end
method which is the mirror ofvalidation_epoch_end
. (#1357) - Added model configuration checking (#1199)
- Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269) - Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279) - Added a warning when the number of data loader workers is small. (#1378)
Changed
- Changed (renamed and refactored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278) - Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when setting to 0. (#1108) - Enhanced
load_from_checkpoint
to also forward params to the model (#1307) - Updated references to self.forward() to instead use the
__call__
interface. (#1211) - Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279) - Allow uploading models on W&B (#1339)
- On DP and DDP2 unsqueeze is automated now (#1319)
- Did not always create a DataLoader during reinstantiation, but the same type as before (if a subclass of DataLoader) (#1346)
- Did not interfere with a default sampler (#1318)
- Removed default Adam optimizer (#1317)
- Gave warnings for unimplemented required lightning methods (#1317)
- Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260) - Simplify the PL examples structure (shallower and more readable) (#1247)
- Changed min-max GPU memory to be on their own plots (#1358)
- Remove
.item
which causes sync issues (#1254) - Changed smoothing in TQDM to decrease variability of time remaining between training/eval (#1194)
- Change default logger to a dedicated one (#1064)
Deprecated
- Deprecated Trainer argument
print_nan_grads
(#1097) - Deprecated Trainer argument
show_progress_bar
(#1108)
Removed
- Removed duplicated module
pytorch_lightning.utilities.arg_parse
for loading CLI arguments (#1167) - Removed wandb logger's
finalize
method (#1193) - Dropped
torchvision
dependency in tests and added own MNIST dataset class instead (#986)
Fixed
- Fixed
model_checkpoint
when saving all models (#1359) Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)- Fixed bug related to type cheking of
ReduceLROnPlateau
lr schedulers(#1114) - Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
- Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1181) - Fixed all warnings and errors in the docs build process (#1191)
- Fixed an issue where
val_percent_check=0
would not disable validation (#1251) - Fixed average of incomplete
TensorRunningMean
(#1309) - Fixed
WandbLogger.watch
withwandb.init()
(#1311) - Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235)
- Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353) - Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311) - Fixed
WandbLogger
to be used with 'ddp' - allow reinits in sub-processes (#1149, #1360) - Made
training_epoch_end
behave likevalidation_epoch_end
(#1357) - Fixed
fast_dev_run
running validation twice (#1365) - Fixed pickle error from quick patch
__code__
(#1352) - Fixed memory leak on GPU0 (#1094, #1349)
- Fixed checkpointing interval (#1272)
- Fixed validation and training loops run the partial dataset (#1192)
- Fixed running
on_validation_end
only on main process in DDP (#1125) - Fixed
load_spawn_weights
only in proc rank 0 (#1385) - Fixes
use_amp
issue (#1145) - Fixes using deprecated
use_amp
attribute (#1145) - Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1375)
- Fixed
Unimplemented backend XLA
error on TPU (#1387)
Contributors
@alexeykarnachev, @amoudgl, @areshytko, @asafmanor, @awaelchli, @bkkaggle, @bmartinn, @Borda, @borisdayma, @cmpute, @djbyrne, @ethanwharris, @gerardrbentley, @jbschiratti, @jeremyjordan, @justusschock, @monney, @mpariente, @pertschuk, @rmrao, @S-aiueo32, @shubhamagarwal92, @SkafteNicki, @sneiman, @tullie, @vanpelt, @williamFalcon, @xingzhaolee
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor deprecation fix
Monir bug fix with print
issues and data_loader
(#1080)
TPU support & profiling
Overview
This is the first joint release between pytorch-bearer and Lightning, here we come ...
This release adds support for training models on Tensor Processing Units (TPU). We can now train models on GPUs and TPUs by changing a single parameter in Trainer
(see docs). We are also bringing the flexibility of Bearer into Lightning by allowing for arbitrary user-defined callbacks, see docs.
We are also including a profiler that allows Lightning users to identify training bottlenecks (see docs).
This release also includes automatic sampler setup depending on the selected backend, Lightning configures the sampler correctly (no need for user input).
The loggers have also been extended to support for multiple concurrent loggers to be passed to Trainer
as an iterable, docs and added support for step-based learning rate scheduling.
At last, lots of bug fixes (see below).
Detail changes
Added
- Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
- Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926) - Added
progress_bar_refresh_rate=50
flag for trainer. The refresh rate on notebooks (#926) - Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idx
argument tobackward
hook (#733) - Added
entity
argument toWandbLogger
to be passed towandb.init
(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804) - Added option to specify
step
key when logging metrics (#808) - Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759) - Added Tensor Processing Unit (TPU) support (#868)
- Added semantic segmentation example (#751, #876, #881)
- Split callbacks in multiple files (#849)
- Support for user-defined callbacks (#889 and #950)
- Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903) - Added support for step-based learning rate scheduling (#941)
- Added support for logging hparams as
dict
(#1029) - Checkpoint and early stopping now work without val. step (#1041)
- Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
- Added type hints for function arguments (#912)
- Added default
argparser
forTrainer
(#952, #1023) - Added TPU gradient clipping (#963)
- Added max/min number of steps in Trainer (#728)
Changed
- Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767) - Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749) - Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995) - Changed Checkpoint path parameter from
filepath
todirpath
(#1016) - Freezed models
hparams
asNamespace
property (#1029) - Dropped
logging
config in package init (#1015) - Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
- Refactor dataloading, supports infinite dataloader (#955)
- Create single file in
TensorBoardLogger
(#777)
Deprecated
- Deprecated
pytorch_lightning.logging
(#767) - Deprecated
LightningModule.load_from_metrics
in favour ofLightningModule.load_from_checkpoint
(#995, #1079) - Deprecated
@data_loader
decorator (#926) - Deprecated model steps
training_end
,validation_end
andtest_end
(#1051, #1056)
Removed
- Removed dependency on
pandas
(#736) - Removed dependency on
torchvision
(#797) - Removed dependency on
scikit-learn
(#801)
Fixed
- Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743) - Fixed a bug where the model checkpoint didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777) - Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832) - Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009) - Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009) - Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017) - Fixed port collision on DDP (#1010)
- Fixed/tested pass overrides (#918)
- Fixed comet logger to log after train (#892)
- Remove deprecated args to learning rate step function (#890)
Contributors
@airglow, @akshaykvnit, @AljoSt, @AntixK, @awaelchli, @baeseongsu, @bobkemp, @Borda, @calclavia, @Calysto, @djbyrne, @ethanwharris, @fdelrio89, @hadim, @hanbyul-kim, @jeremyjordan, @kuynzereb, @luiscape, @MattPainter01, @neggert, @onkyo14taro, @peteriz, @shoarora, @SkafteNicki, @smallzzy, @srush, @theevann, @tullie, @williamFalcon, @xeTaiz, @xssChauhan, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Simplifications & new docs
This release focused on a ton of bug fixes, small optimizations to training but most importantly, clean new docs!
Major changes
We have released New documentation, please bear with us as we fix broken links and patch in missing pieces.
This project moved to new org PyTorchLightning, so no longer the root sits on WilliamFalcon/PyTorchLightning.
We have added own custom Tensorboard logger as default logger.
We have upgrade Continues Integration to speed up the automatic testing.
We have fixed GAN training - supporting multiple optimizers.
Complete changelog
Added
- Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516) - Added support for
ReduceLROnPlateau
scheduler (#320) - Added support for Apex mode
O2
in conjunction with Data Parallel (#493) - Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128) - Added
on_train_start
andon_train_end
hooks toModelHooks
(#598) - Added
TensorBoardLogger
(#607) - Added support for weight summary of model with multiple inputs (#543)
- Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625) - Added option to disable validation by setting
val_percent_check=0
(#649) - Added
NeptuneLogger
class (#648) - Added
WandbLogger
class (#627)
Changed
- Changed the default progress bar to print to stdout instead of stderr (#531)
- Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589) - Renamed several
Trainer
atributes: (#567)total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
,- and
nb_val_batches
tonum_val_batches
(#567)
- Changed gradient logging to use parameter names instead of indexes (#660)
- Changed the default logger to
TensorBoardLogger
(#609) - Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
Deprecated
- Deprecated
max_nb_epochs
andmin_nb_epochs
(#567) - Deprecated the
on_sanity_check_start
hook inModelHooks
(#598)
Removed
- Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
Fixed
- Fixed a bug which ocurred when using Adagrad with cuda (#554)
- Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561) - Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579) - Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
- Fixed support for PyTorch 1.1.0 (#552)
- Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492) - Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481) - Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509) - Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
- Fixed a bug where batch 'segments' would remain on the GPU when using
truncated_bptt > 1
(#532) - Fixed a bug when using
IterableDataset
(#547](#547)) - Fixed a bug where
.item
was called on non-tensor objects (#602) - Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608) - Fixed a bug where early stopping would begin two epochs early (#617)
- Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649) - Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653) - Fixed a bug when batches did not have a
.copy
method (#701) - Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715) - Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
- Fixed a bug where
on_train_end
was not called when early stopping (#723)
Contributors
@akhti, @alumae, @awaelchli, @Borda, @borisdayma, @ctlaltdefeat, @dreamgonfly, @elliotwaite, @fdiehl, @goodok, @haossr, @HarshSharma12, @Ir1d, @jakubczakon, @jeffling, @kuynzereb, @MartinPernus, @matthew-z, @MikeScarp, @mpariente, @neggert, @rwesterman, @ryanwongsa, @schwobr, @tullie, @vikmary, @VSJMilewski, @williamFalcon, @YehCF
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Generalization!
Generalization release
The main focus of this release was on adding flexibility and generalization to support broad research cases.
Next release will be Dec 7th (every 30 days).
Internal Facebook support
@lorenzoFabbri @tullie @myleott @ashwinb @shootingsoul @vreis
These features were added to support FAIR, FAIAR and broader ML across other FB teams.
In general, we can expose any part that isn't exposed yet where someone might want to override the lightning implementation.
- Added truncated back propagation through time support (thanks @tullie).
Trainer(truncated_bptt_steps=2)
- Added iterable datasets.
# return iterabledataset
def train_dataloader(...):
ds = IterableDataset(...)
return Dataloader(ds)
# set validation to a fix number of batches
# (checks val every 100 train epochs)
Trainer(val_check_interval=100)
- Add ability to customize backward and other training parts:
def backward(self, use_amp, loss, optimizer):
"""
Override backward with your own implementation if you need to
:param use_amp: Whether amp was requested or not
:param loss: Loss is already scaled by accumulated grads
:param optimizer: Current optimizer being used
:return:
"""
if use_amp:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
- DDP custom implementation support (override these hooks):
def configure_ddp(self, model, device_ids):
"""
Override to init DDP in a different way or use your own wrapper.
Must return model.
:param model:
:param device_ids:
:return: DDP wrapped model
"""
model = LightningDistributedDataParallel(
model,
device_ids=device_ids,
find_unused_parameters=True
)
return model
def init_ddp_connection(self, proc_rank, world_size):
"""
Connect all procs in the world using the env:// init
Use the first node as the root address
"""
# use slurm job id for the port number
# guarantees unique ports across jobs from same grid search
try:
# use the last 4 numbers in the job id as the id
default_port = os.environ['SLURM_JOB_ID']
default_port = default_port[-4:]
# all ports should be in the 10k+ range
default_port = int(default_port) + 15000
except Exception as e:
default_port = 12910
# if user gave a port number, use that one instead
try:
default_port = os.environ['MASTER_PORT']
except Exception:
os.environ['MASTER_PORT'] = str(default_port)
# figure out the root node addr
try:
root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
except Exception:
root_node = '127.0.0.2'
root_node = self.trainer.resolve_root_node_address(root_node)
os.environ['MASTER_ADDR'] = root_node
dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)
- Support for your own apex init or implementation.
def configure_apex(self, amp, model, optimizers, amp_level):
"""
Override to init AMP your own way
Must return a model and list of optimizers
:param amp:
:param model:
:param optimizers:
:param amp_level:
:return: Apex wrapped model and optimizers
"""
model, optimizers = amp.initialize(
model, optimizers, opt_level=amp_level,
)
return model, optimizers
- DDP2 implementation (inspired by parlai and @stephenroller).
DDP2 acts as DP in the node and DDP across nodes.
As a result, an optional method is introducedtraining_end
where you can use the outputs oftraining_step
(performed on each GPU with a portion of the batch),
to do something with the outputs of all batches on the node (ie: negative sampling).
Trainer(distributed_backend='ddp2')
def training_step(...):
# x is 1/nb_gpus of the full batch
out = model(x)
return {'out': out}
def training_end(self, outputs):
# all_outs has outs from ALL gpus
all_outs = outputs['out']
loss = softmax(all_outs)
return {'loss': loss}
Logging
- More logger diversity including Comet.ml.
- Versioned logs for all loggers.
- switched from print to logging
progress bar
- now the progress bar has a full bar for the full train + val epochs and a second bar visible only during val.
loading
- checkpoints now store hparams
- no need to pass tags.csv to restore state because it lives in the checkpoint.
Slurm resubmit with apex + ddp
- Fixes issue of ddp restore weights blowing out GPU memory (load on cpu first then GPU).
- Saves apex states automatically and restores it for a checkpoint.
Refactoring
- internal code made modular through Mixins for ease of readability and to minimize merge conflicts.
Docs
- Tons of doc improvements.
Thanks!
Thank you to the amazing contributor community! Especially @neggert and @Borda for reviewing PRs and taking care of a good number of Github issues. The community is thriving and has really embraced making Lightning better.
Great job everyone!
Simpler interface, new features
0.5.1
Simpler interface
All trainers now have a default logger, early stopping and checkpoint object. To modify the behavior, pass in your own versions of those.
- Removed collisions with logger versions by tying it to job id.
Features
- Added new DDP implementation. It uses DP in a node but allows multiple nodes. Useful for models which need negative samples, etc...
Trainer(distributed_backend='ddp2')
- support for LBFGS. If you pass in LBFGS Lightning handles the closure for you automatically.
- No longer need to set master port, Lightning does it for you using the job id.
Minor changes
-
training_step and validation_end now return two separate dicts, one for the progress bar and one for logging.
-
Added options to memory printing: 'min_max' logs only the max/min memory use. 'all' logs all the GPUs on the root node.
API clean up
This release has breaking API changes. See #124 for all details.
Syntax changes are:
in trainer options use: train, test, val
for data: val_dataloader, test_dataloader, train_dataloader
data_batch -> batch
prog -> progress
gradient_clip -> gradient_clip_val
add_log_row_interval -> row_log_interval
Various ddt improvements
This release does the following:
- Moves SLURM resubmit from test-tube to PL (which removes the need for cluster parameter).
- Cluster checkpoint done by Lightning now (not test-tube). Also doesn't require a checkpoint object to restore weights when on cluster.
- Loads all models on CPU when restoring weights to avoid OOM issues in PyTorch. User now needs to move to GPU manually. However, if using Lightning, lightning will move to correct GPUs automatically.
- Fixes various subtle bugs in DDP implementation.
- documentation updates
New features
- validation_step, val_dataloader are now optional.
- enabled multiple dataloaders for validation.
- support for latest test-tube logger optimized for PT 1.2.0.
- lr_scheduler now activated after epoch