GitHunt
AL

alea-institute/alea-preprocess

Accessible, efficient data preprocessing library for pretrain and SFT datasets, including KL3M

alea-preprocess

PyPI version
License: MIT
Python Versions

Description

Efficient, accessible preprocessing routines for pretrain, SFT, and DPO training data preparation.

This library is part of ALEA's open source large language model training pipeline, used in the research and development
of the KL3M project.

Installation

Note that this project is a work-in-progress and relies on compiled Rust code. As such, it is recommended to install
the package from GitHub source until a stable release is available.

You can install the latest release from PyPI using pip:

pip install alea-preprocess

You can install a development version of the package by running the following command:

poetry run maturin develop

Examples

Example use cases are currently available under the tests/ directory.

Additional documentation and examples will be provided in the future.

License

This ALEA project is released under the MIT License. See the LICENSE file for details.

Support

If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.

Learn More

To learn more about ALEA and its software and research projects like KL3M, visit the ALEA website.

Languages

HTML88.7%Rust9.4%Python1.8%Dockerfile0.1%Shell0.1%

Contributors

Created September 25, 2024
Updated November 11, 2024