CODI is an accessible and user-friendly REST microservice that can automate the disambiguation
of a set of instant messages to form conversations by leveraging state-of-the-art machine learning algorithms.
To run the webserver and correctly disentangle conversations, you will need the following:
- Python (v3.10)
- (optional, see how to compile MEGAM) OCaml (v4.12.0)
This project was developed in a custom Conda environment. To recreate such an environment, execute the following commands:
conda env create -f ./environment.yml
conda activate codi
To deactivate and delete the environment, execute the following commands:
conda deactivate
conda remove --name codi --all
To run the webserver locally, you first need to generate a Django secret code — which can be done with the following command (from the root of this repository):
python -c 'from django.core.management.utils import get_random_secret_key; print(get_random_secret_key())'
Once the key has been generated, create a .env
file in the repository's root. The .env
file must be structured as
follows:
DJANGO_DEBUG=True
DJANGO_SECRET_KEY="your-key"
We include the MEGAM Max Entropy Classifier's latest version from Hal Daume III. To compile it you can run the following commands:
cd codi/api/utils/megam_0.92
make clean
make depend
make
You can start a local instance of CoDi with the provided run_server.sh
script:
cd scripts
chmod +x ./run_server.sh
./run_server.sh
Run and debug configurations are available in the directory .run
for users who have
PyCharm Professional.
To use them, open this repository in PyCharm; it will automatically import the configurations for you.
First, navigate to the Django configuration "Run server" > Environment > Environment variables.
Here you need to add a new environment variable with key DJANGO_SECRET_KEY
and value your-key
and an environment variable with key DJANGO_DEBUG
and value True
(the key is the same as the one you generated
earlier).
The compound configuration "Run server" will compile the megam binary and run the server. If you need to run the server, you can use the Django configuration. "Start server".
We also offer a Docker image. The Dockerfile and docker-compose for the image can be found in the project's root directory. This image can also be built using the Docker configuration "Compose" (for PyCharm Professional users).
Before running this configuration, ensure that you have a file named .env.production
in the repo's root directory.
The file needs to have the same structure as the .env
described earlier. In this case, we recommend setting the
DJANGO_DEBUG
variable to False
.
datasets
contains some example datasets (ANNOT or JSON formats) taken from previously published papers.
datasets/annot/from_previous_papers
and datasets/json/from_previous_papers
include datasets previously published in:
- Elsner, M., & Charniak, E. (2010). Disentangling chat. Computational Linguistics, 36(3), pp. 389-409, ACL, 2010.
- Chatterjee, P., Damevski, K., Kraft, N. A., & Pollock, L. (2020). Software-related Slack chats with disentangled conversations. In Proceedings of MSR 2020 (International Conference on Mining Software Repositories), pp. 588-592, ACM, 2020.
- Subash, K. M., Kumar, L. P., Vadlamani, S. L., Chatterjee, P., & Baysal, O. (2022). DISCO: A Dataset of Discord Chat Conversations for Software Engineering Research. In Proceedings of MSR 2022 (International Conference on Mining Software Repositories), ACM, 2022.
On the Hompage of CoDi you can train the model from scratch and disentangle and visualize an example dataset:
- Check that the Train operation is selected (already selected by default)
- Select Slack as the type of platform you want to train on
- Leave all the features enabled
- Drag and drop the training set
datasets/annot/from_previous_papers/training.annot
in the page - Wait for the operation to complete (console debug information will inform you about progress)
- When the Train operation is completed you can select a Predict operation and a Discord platform and drag and drop
one of the Discord datasets to perform a disentanglement prediction
(e.g.,
datasets/annot/from_previous_papers/clojure_Feb2020-Apr2020.annot
) - After a successful validation (try, for example,
datasets/json/from_previous_papers/validation.json
) you can also check the Statistics box for information about disentanglement performance (e.g., Accuracy, F1-score)
CODI was presented and used in the following scientific research papers:
- Riggio, E., Raglianti, M., & Lanza, M. (2023). Conversation Disentanglement As-a-Service. Proceedings of ICPC 2023 (International Conference on Program Comprehension), in press, IEEE.
Copyright (c) 2023 Edoardo Riggio, Marco Raglianti, Michele Lanza, REVEAL @ Software Institute - USI, Lugano, Switzerland
Distributed under the MIT License. See LICENSE for more information.
- REVEAL - https://reveal.si.usi.ch