Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Project Status and Potential Contributions #1

Open
Hannibal046 opened this issue Jan 18, 2025 · 6 comments
Open

Question about Project Status and Potential Contributions #1

Hannibal046 opened this issue Jan 18, 2025 · 6 comments

Comments

@Hannibal046
Copy link

Hannibal046 commented Jan 18, 2025

Hi Team,

First, I want to express my appreciation for maintaining this repository and fla. I'm finding both projects very valuable.

I have several questions about the project:

  1. Project Status

    • Is this repository actively being developed?
    • Are you accepting external contributions?
  2. Development Direction

    • If external contributions are welcome, could you share your roadmap?
    • I'd be interested in contributing based on the project's goals.
  3. Technical Architecture
    From my understanding:

    • The project uses fla for model definition
    • Training is handled by torchtitan
    • Given fla's HuggingFace compatibility, it should work with lm-eval-harness for evaluation
      Could you confirm if this understanding is correct?
  4. Future Plans

    • Are there plans to extend into post-training scenarios?
    • If so, open-instruct could be a valuable reference point.

Looking forward to your response and potentially contributing to the project.

Best regards

@yzhangcs
Copy link
Member

@Hannibal046 Hi, yes, all of your understandings are correct.
This project is actively being developed, and we're continuously adding more features to flame. For example:

  1. While torchtitan only supports 4D parallelism for Llama, we aim to provide comprehensive support for all FLA.
  2. We're implementing support for online data tokenization with shuffling, which is currently lacking in torchtitan.

Regarding post-training, I don't have extensive experience in this field yet. However, I'd be very glad if you could contribute in this area. I also plan to add support for post-training features in the future.
In a word, flame is a framework closely integrated with fla and transformers, with ambitions to scale to much larger scales.

@Hannibal046
Copy link
Author

Hannibal046 commented Jan 18, 2025

@yzhangcs
Hi, thanks for the quick reply!

If you're planning to implement support for online data tokenization with shuffling, I'd like to share an elegant implementation from Meta Lingua for your reference. Their approach:

  1. Pre-shuffles data;
  2. Accept Jsonline as inputs and performs online tokenization and reshuffling with a buffer;
  3. Easily controls the ratio of different data sources;

I'm not sure which specific features you need to implement, but relying solely on "2. online tokenization and reshuffling with a buffer" might not be sufficient for large-scale training. This is because some datasets from Hugging Face are chronologically ordered, and even with a large online buffer, the data would still be biased.

I'm happy to help if you need any assistance!

@yzhangcs
Copy link
Member

Thank you! I will be taking a look at it.

@wconstab
Copy link

wconstab commented Feb 4, 2025

I'd also like to ask if your experience extending from TorchTitan was smooth, or if you have any suggestions/requests for extensibility features. (e.g. making TorchTitan easier to build on). (I'm a TorchTitan developer)

@yzhangcs
Copy link
Member

yzhangcs commented Feb 4, 2025

Hey @wconstab, thank you for developing this fantastic framework!
We've migrated to torchtitan very smoothly, and the training speed is impressively fast! Switching to torchtitan was definitely one of my wisest decisions in 2024. :-)

flame is built on torchtitan with some modifications, both current and planned:

  1. We're more familiar with HF-style model definitions, as the entire fla ecosystem is built on HF. Our biggest change is adapting torchtitan to support HF model definitions and initializations.
  2. We've revised the process for online tokenization to support stateful data shuffling. The code is simple and neat IMO: https://github.com/fla-org/flame/blob/main/flame/data.py#L230. We're planning to submit a PR to the HF datasets team, hoping that stateful shuffling for iterable datasets could be officially supported. While this might not be a major issue for very large datasets, as mentioned by @tianyu-l in data shuffling pytorch/torchtitan#635, we still believe it's a valuable feature for training on medium-sized datasets.
  3. Another important feature not currently supported by torchtitan or flame is also data-related. It could be very useful for training in environments with a varying number of GPUs. As I understand it, we can currently only recover the training process from distributed environments with a fixed number of GPUs/Nodes?
  4. We're planning to support CP/TP/PP for fla soon. I've just looked through torchtitan's implementations for the Llama architecture, and they're very clean! I'm planning to extend your implementations to the fla library. I'll reach out if I have any questions.

The goal of flame is to develop a minimal yet complete framework for fla, so we're very pleased when features we desire are officially supported. We've already noticed that some of these features are on the roadmap for torchtitan. I hope my feedback provides some useful additional insights.

@yzhangcs yzhangcs pinned this issue Feb 4, 2025
@tianyu-l
Copy link

tianyu-l commented Feb 6, 2025

@yzhangcs Thank you for the nice feedbacks!

We're planning to submit a PR to the HF datasets team, hoping that stateful shuffling for iterable datasets could be officially supported.

torchtitan currently depends on HF datasets. If your solution is upstreamed to HF, we are more than happy to adapt to support stateful shuffling. As you have noted, the feature implementation itself may be outside the scope of torchtitan.

As I understand it, we can currently only recover the training process from distributed environments with a fixed number of GPUs/Nodes?

In fact, DCP in general supports resharding (varying number of GPUs, or varying parallelisms) pretty well. It's the data loader which doesn't / makes it non-trivial to support resuming from resharding. However, if you don't need to load data in the same way (e.g. continue training using a new dataset), we currently have a PR to optionally not load data loader checkpoint. See pytorch/torchtitan#819

I'm planning to extend your implementations to the fla library. I'll reach out if I have any questions.

Any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants