Question about Project Status and Potential Contributions #1

Hannibal046 · 2025-01-18T07:59:30Z

Hi Team,

First, I want to express my appreciation for maintaining this repository and fla. I'm finding both projects very valuable.

I have several questions about the project:

Project Status
- Is this repository actively being developed?
- Are you accepting external contributions?
Development Direction
- If external contributions are welcome, could you share your roadmap?
- I'd be interested in contributing based on the project's goals.
Technical Architecture
From my understanding:
- The project uses fla for model definition
- Training is handled by torchtitan
- Given fla's HuggingFace compatibility, it should work with lm-eval-harness for evaluation
  Could you confirm if this understanding is correct?
Future Plans
- Are there plans to extend into post-training scenarios?
- If so, open-instruct could be a valuable reference point.

Looking forward to your response and potentially contributing to the project.

Best regards

The text was updated successfully, but these errors were encountered:

yzhangcs · 2025-01-18T09:02:49Z

@Hannibal046 Hi, yes, all of your understandings are correct.
This project is actively being developed, and we're continuously adding more features to flame. For example:

While torchtitan only supports 4D parallelism for Llama, we aim to provide comprehensive support for all FLA.
We're implementing support for online data tokenization with shuffling, which is currently lacking in torchtitan.

Regarding post-training, I don't have extensive experience in this field yet. However, I'd be very glad if you could contribute in this area. I also plan to add support for post-training features in the future.
In a word, flame is a framework closely integrated with fla and transformers, with ambitions to scale to much larger scales.

Hannibal046 · 2025-01-18T09:25:33Z

@yzhangcs
Hi, thanks for the quick reply!

If you're planning to implement support for online data tokenization with shuffling, I'd like to share an elegant implementation from Meta Lingua for your reference. Their approach:

Pre-shuffles data;
Accept Jsonline as inputs and performs online tokenization and reshuffling with a buffer;
Easily controls the ratio of different data sources;

I'm not sure which specific features you need to implement, but relying solely on "2. online tokenization and reshuffling with a buffer" might not be sufficient for large-scale training. This is because some datasets from Hugging Face are chronologically ordered, and even with a large online buffer, the data would still be biased.

I'm happy to help if you need any assistance!

yzhangcs · 2025-01-18T09:27:50Z

Thank you! I will be taking a look at it.

wconstab · 2025-02-04T00:11:06Z

I'd also like to ask if your experience extending from TorchTitan was smooth, or if you have any suggestions/requests for extensibility features. (e.g. making TorchTitan easier to build on). (I'm a TorchTitan developer)

yzhangcs · 2025-02-04T12:57:46Z

Hey @wconstab, thank you for developing this fantastic framework!
We've migrated to torchtitan very smoothly, and the training speed is impressively fast! Switching to torchtitan was definitely one of my wisest decisions in 2024. :-)

flame is built on torchtitan with some modifications, both current and planned:

We're more familiar with HF-style model definitions, as the entire fla ecosystem is built on HF. Our biggest change is adapting torchtitan to support HF model definitions and initializations.
We've revised the process for online tokenization to support stateful data shuffling. The code is simple and neat IMO: https://github.com/fla-org/flame/blob/main/flame/data.py#L230. We're planning to submit a PR to the HF datasets team, hoping that stateful shuffling for iterable datasets could be officially supported. While this might not be a major issue for very large datasets, as mentioned by @tianyu-l in data shuffling pytorch/torchtitan#635, we still believe it's a valuable feature for training on medium-sized datasets.
Another important feature not currently supported by torchtitan or flame is also data-related. It could be very useful for training in environments with a varying number of GPUs. As I understand it, we can currently only recover the training process from distributed environments with a fixed number of GPUs/Nodes?
We're planning to support CP/TP/PP for fla soon. I've just looked through torchtitan's implementations for the Llama architecture, and they're very clean! I'm planning to extend your implementations to the fla library. I'll reach out if I have any questions.

The goal of flame is to develop a minimal yet complete framework for fla, so we're very pleased when features we desire are officially supported. We've already noticed that some of these features are on the roadmap for torchtitan. I hope my feedback provides some useful additional insights.

tianyu-l · 2025-02-06T00:54:23Z

@yzhangcs Thank you for the nice feedbacks!

We're planning to submit a PR to the HF datasets team, hoping that stateful shuffling for iterable datasets could be officially supported.

torchtitan currently depends on HF datasets. If your solution is upstreamed to HF, we are more than happy to adapt to support stateful shuffling. As you have noted, the feature implementation itself may be outside the scope of torchtitan.

As I understand it, we can currently only recover the training process from distributed environments with a fixed number of GPUs/Nodes?

In fact, DCP in general supports resharding (varying number of GPUs, or varying parallelisms) pretty well. It's the data loader which doesn't / makes it non-trivial to support resuming from resharding. However, if you don't need to load data in the same way (e.g. continue training using a new dataset), we currently have a PR to optionally not load data loader checkpoint. See pytorch/torchtitan#819

I'm planning to extend your implementations to the fla library. I'll reach out if I have any questions.

Any time.

yzhangcs pinned this issue Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Project Status and Potential Contributions #1

Question about Project Status and Potential Contributions #1

Hannibal046 commented Jan 18, 2025 •

edited

Loading

yzhangcs commented Jan 18, 2025

Hannibal046 commented Jan 18, 2025 •

edited

Loading

yzhangcs commented Jan 18, 2025

wconstab commented Feb 4, 2025

yzhangcs commented Feb 4, 2025

tianyu-l commented Feb 6, 2025

Question about Project Status and Potential Contributions #1

Question about Project Status and Potential Contributions #1

Comments

Hannibal046 commented Jan 18, 2025 • edited Loading

yzhangcs commented Jan 18, 2025

Hannibal046 commented Jan 18, 2025 • edited Loading

yzhangcs commented Jan 18, 2025

wconstab commented Feb 4, 2025

yzhangcs commented Feb 4, 2025

tianyu-l commented Feb 6, 2025

Hannibal046 commented Jan 18, 2025 •

edited

Loading

Hannibal046 commented Jan 18, 2025 •

edited

Loading