Skip to content

collection of scripts for the preprocessing of dataset used for the "liaisons" project.

License

Notifications You must be signed in to change notification settings

coding-kelps/liaisons-preprocess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚠️ This repository is a part of an academical project for the Heriot-Watt University, no third-party contributions are accepted.

liaisons-preprocess

Overview

This notebook outlines the preprocessing steps undertaken to generate datasets used for testing the liaisons client and benchmarking open-source models in predicting argument relations.

Dataset

For this preprocessing task, we utilized IBM's "Claim Stance Dataset" as the source. This dataset comprises 2,394 labeled claims across 55 controversial topics, collected from Wikipedia. Each claim is labeled based on its stance towards the topic, either "PRO" (supporting the topic) or "CON" (opposing the topic), making it an excellent resource for relation-based argument mining tasks.

As of now, the dataset is available on HuggingFace. Further details on the original dataset can be found in the paper Stance Classification of Context-Dependent Claims (Bar-Haim et al., 2017).

Preprocessing Notes

The primary aim of this preprocessing is to create representative samples of the dataset, roughly 100 entries, enabling benchmarking with limited computing resources. Secondly, previous models in the field of relation-based argument mining have proven to give misleading benchmarks by achieving satisfactory results in specific domains (Gorur et al., 2024). To circumvent this, the preprocessing will also modify the distribution of claims to achieve a balanced plurality of stances and topics.

Results

Processing results can be found on HuggingFace at coding-kelps/liaisons-claim-stance-sample.

About Contributions

As mentioned earlier, this work is part of an academic project for the validation of my Master's Degree at Heriot-Watt University, preventing me from accepting any contributions until the final release of my project. Thank you for your understanding.

Associated Works

This work is part of a collection of works whose ultimate goal is to deliver a framework to automatically analyze social media content (e.g., X, Reddit) to extract their argumentative value and predict their relations, leveraging Large Language Models' (LLMs) abilities:

About the Development Team

This project is solely conducted by me, Guilhem Santé. I am a postgraduate student pursuing the MSc in Artificial Intelligence at Heriot-Watt University in Edinburgh.

Special Thanks

I would like to credits Andrew Ireland, my supervisor for this project.

About

collection of scripts for the preprocessing of dataset used for the "liaisons" project.

Topics

Resources

License

Stars

Watchers

Forks