A robust, platform agnostic, and highly efficient framework for converting old fixed-length files to future-proof targets suitable for analytics and data science.
The evolution project was created as a response to the emergin need for a tool which can transform old fixed-length files to data formats which seamlessly integrate with the modern data analytics landscape, whilst being able to do so fully automatically.
We utilize the native speed of Rust together with multithreading and SIMD techniques to efficiently transform your old fixed-length files (of any size!) to a more modern target. The only target currently implemented is parquet, but we aim to implement support for delta, iceberg, indradb, and more.
The project is structured as a monorepo which hosts all of the evolution framework components, which can be found under crates/ as their own modules. A modular monorepo design of the framework allows anyone to implement their own target converters that can seamlessly integrate with core frameworks existing functionality.
The easiest way to install an evolution binary on your system with support for all implemented output targets is by using the Cargo package manager (which downloads it from this link). This binary can be found at examples/full in this repo.
cargo install evolution
(available features)
- mock
- nightly
Alternatively you can build everything from source by cloing the repo and compiling using Cargo.
git clone https://github.com/firelink-data/evolution.git
cd evolution
cargo build --release
If you want to integrate any of the evolution crates in your own project that you're building, simply add them as dependencies to your projects Cargo.toml file like you would any other third-party dependecy, like below.
[dependencies]
evolution-common = "1.2.0"
evolution-schema = "1.2.0"
To be able to work with automatic file conversion you need to have a valid schema available which specifies the structure of the source file you want to convert. A valid schema, in this context, is a json file which adhers to this template. If you are unsure whether or not your own schema file is valid according to the template, you can use this validator tool.
An example schema can be found here, and if you are unsure about valid values for datatypes, alignment modes, and padding symbols, please refer to the template which lists all valid values. For specifics on all the currently supported padding modes, characters, and default values, please see the padder crate (which we also maintain).
If you install the program as explained above then by simply running the binary you will see the following usage print:
Efficiently evolve your old fixed-length data files into modern file formats.
Usage: evolution.exe [OPTIONS] <COMMAND>
Commands:
convert Convert a fixed-length file to another file format
mock Generate mocked fixed-length files
help Print this message or the help of the given subcommand(s)
Options:
-N, --n-threads <N_THREADS>
Enable multithreading and set the number of threads (logical cores) to use [default: 1]
-C, --thread-channel-capacity <THREAD_CHANNEL_CAPACITY>
The maximum capacity of the thread channel (in number of messages) [default: 32]
-R, --read-buffer-size <READ_BUFFER_SIZE>
The size of the read buffer used when converting (in bytes) [default: 5368709120]
-W, --write-buffer-size <WRITE_BUFFER_SIZE>
The size of the write buffer used when mocking (in rows) [default: 1000000]
-h, --help
Print help
-V, --version
Print version
To specify the log verbosity set the RUST_LOG
environment variable to your wanted value, e.g., INFO
.
To know how many threads (logical cores) you have available run either of the following commands depending on your host system:
- Windows:
- Command:
Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors
- Use the value found under NumberOfLogicalProcessors.
- Command:
- Unix:
- Command:
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
- The number of logical cores is calculed as: threads per core X cores per socket X sockets.
- Command:
All code is copyright of firelink and published under a general MIT license, please see LICENSE for specific information.