This notebook outlines the preprocessing steps undertaken to generate datasets used for testing the liaisons client and benchmarking open-source models in predicting argument relations.
For this preprocessing task, we utilized IBM's "Claim Stance Dataset" as the source. This dataset comprises 2,394 labeled claims across 55 controversial topics, collected from Wikipedia. Each claim is labeled based on its stance towards the topic, either "PRO" (supporting the topic) or "CON" (opposing the topic), making it an excellent resource for relation-based argument mining tasks.
As of now, the dataset is available on HuggingFace. Further details on the original dataset can be found in the paper Stance Classification of Context-Dependent Claims (Bar-Haim et al., 2017).
The primary aim of this preprocessing is to create representative samples of the dataset, roughly 100 entries, enabling benchmarking with limited computing resources. Secondly, previous models in the field of relation-based argument mining have proven to give misleading benchmarks by achieving satisfactory results in specific domains (Gorur et al., 2024). To circumvent this, the preprocessing will also modify the distribution of claims to achieve a balanced plurality of stances and topics.
Processing results can be found on HuggingFace at coding-kelps/liaisons-claim-stance-sample.
As mentioned earlier, this work is part of an academic project for the validation of my Master's Degree at Heriot-Watt University, preventing me from accepting any contributions until the final release of my project. Thank you for your understanding.
This work is part of a collection of works whose ultimate goal is to deliver a framework to automatically analyze social media content (e.g., X, Reddit) to extract their argumentative value and predict their relations, leveraging Large Language Models' (LLMs) abilities:
- liaisons (the developed client for social media content analysis)
- liaisons-claim-stance-sample (the resulting sample of this preprocess)
- liaisons-experiments (the benchmarking framework to evaluate LLMs' relation prediction abilities)
- liaisons-experiments-results (the obtained results with this benchmarking)
- mantis-shrimp (the configuration-as-code used to set up my workstation for this project)
This project is solely conducted by me, Guilhem Santé. I am a postgraduate student pursuing the MSc in Artificial Intelligence at Heriot-Watt University in Edinburgh.
I would like to credits Andrew Ireland, my supervisor for this project.