.. 
    Copyright 2021 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Accelerate
=======================================================================================================================

Run your *raw* PyTorch training script on any kind of device

Features
-----------------------------------------------------------------------------------------------------------------------

- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed
  setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs
  seamlessly on your local machine for debugging or your training environment.

- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then
  launch the scripts.


Easy to integrate
-----------------------------------------------------------------------------------------------------------------------

A traditional training loop in PyTorch looks like this:

.. code-block:: python

    my_model.to(device)

    for batch in my_training_dataloader:
        my_optimizer.zero_grad()
        inputs, targets = batch
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = my_model(inputs)
        loss = my_loss_function(outputs, targets)
        loss.backward()
        my_optimizer.step()

Changing it to work with accelerate is really easy and only adds a few lines of code:

.. code-block:: diff

    + from accelerate import Accelerator

    + accelerator = Accelerator()
      # Use the device given by the `accelerator` object.
    + device = accelerator.device
      my_model.to(device)
      # Pass every important object (model, optimizer, dataloader) to `accelerator.prepare`
    + my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
    +     my_model, my_optimizer, my_training_dataloader
    + )

      for batch in my_training_dataloader:
          my_optimizer.zero_grad()
          inputs, targets = batch
          inputs = inputs.to(device)
          targets = targets.to(device)
          outputs = my_model(inputs)
          loss = my_loss_function(outputs, targets)
          # Just a small change for the backward instruction
    -     loss.backward()
    +     accelerate.backward(loss)
          my_optimizer.step()

and with this, your script can now run in a distributed environment (multi-GPU, TPU).

You can even simplify your script a bit by letting 🤗 Accelerate handle the device placement for you (which is safer,
especially for TPU training):

.. code-block:: diff

    + from accelerate import Accelerator

    + accelerator = Accelerator()
    - my_model.to(device)
      # Pass every important object (model, optimizer, dataloader) to `accelerator.prepare`
    + my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
    +     my_model, my_optimizer, my_training_dataloader
    + )

      for batch in my_training_dataloader:
          my_optimizer.zero_grad()
          inputs, targets = batch
    -     inputs = inputs.to(device)
    -     targets = targets.to(device)
          outputs = my_model(inputs)
          loss = my_loss_function(outputs, targets)
          # Just a small change for the backward instruction
    -     loss.backward()
    +     accelerate.backward(loss)
          my_optimizer.step()


Script launcher
-----------------------------------------------------------------------------------------------------------------------

No need to remember how to use ``torch.distributed.launch`` or to write a specific launcher for TPU training! 🤗
Accelerate comes with a CLI tool that will make your life easier when launching distributed scripts.

On your machine(s) just run:

.. code-block:: bash

    accelerate config

and answer the questions asked. This will generate a config file that will be used automatically to properly set the
default options when doing

.. code-block:: bash

    accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the NLP example (from the root of the repo):

.. code-block:: bash

    accelerate launch examples/nlp_example.py


Supported integrations
-----------------------------------------------------------------------------------------------------------------------

- CPU only
- single GPU
- multi-GPU on one node (machine)
- multi-GPU on several nodes (machines)
- TPU
- FP16 with native AMP (apex on the roadmap)
- DeepSpeed (experimental support)

.. toctree::
    :maxdepth: 2
    :caption: Get started

    quicktour
    installation

.. toctree::
    :maxdepth: 2
    :caption: Guides

    sagemaker

.. toctree::
    :maxdepth: 2
    :caption: API reference

    accelerator
    launcher
    kwargs
    internal