DSIR large-scale data selection framework for language model training
-
Updated
Apr 7, 2024 - Python
DSIR large-scale data selection framework for language model training
GUNDAM is a data management system that prioritizes data using language models.
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
Framework for processing and filtering datasets
[ACL 2025 main] SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models
This repository contains all (Python 3) code and libraries required for the 2022-2023 Notre Dame Rocketry Team (NDRT) Apogee Control System (ACS). It also contains sensor/actuator example code and flight data.
Base-call error-filtering and read preprocessing pipeline for fastq libraries
Anonymises data inside text files and in sheet files. It recognises and removes various sorts of personally identifiable information (PII). Each removed part is replaced with a suitable generic text, depending on the type of removed data. Currently English and Russian languages are supported. Russian works both with Cyrillic and Latin characters.
A powerful tool that allows users to query JSON data using SQL-like syntax. Effortlessly search, filter, and manipulate your JSON data with familiar SQL queries.
🤖Ngram Similarity Engine📚
A powerful, interactive desktop dashboard built with PyQt5, Matplotlib, Seaborn, Plotly, and scikit-learn. Designed for data wrangling, visualization, and machine learning—all in one elegant dark-themed GUI.
This Python script filters out incorrectly formatted lines in the `lottery_numbers.csv` file and saves only the valid ones in `correct_numbers.csv`.
Drawer automates single-elimination draw systems, ensuring fairness with balanced group allocation and bias-free brackets. Now enhanced with Docker, it eliminates dependency issues for seamless event management.
Data exploration project introduced by Udacity Data Analysis Nanodegree
This is an interactive Streamlit dashboard designed to visualize and analyze business data such as employee salaries, departmental distribution, and demographic statistics. It integrates with a MySQL database and offers real-time filtering and graphical insights.
A Python script to filter and extract information from GTF files based on chromosome names, designed to be easily accessible for biologists without extensive programming experience.
scripts to make life easier and organized
Details the data modeling techniques used, the functionality of the output, and an in-depth idea of how a plan finder works based off of user inputs.
Filter DE genes based on log2Folchange, FDR value or both
Add a description, image, and links to the data-filtering topic page so that developers can more easily learn about it.
To associate your repository with the data-filtering topic, visit your repo's landing page and select "manage topics."