Skip to content

Commit

Permalink
Merge pull request #11 from SasCezar/dev
Browse files Browse the repository at this point in the history
Bumped dependencies, updated UI
  • Loading branch information
SasCezar authored Jan 27, 2024
2 parents 046b90d + 920c529 commit 5d230cd
Show file tree
Hide file tree
Showing 2 changed files with 81 additions and 44 deletions.
99 changes: 68 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,47 @@
# AutoFL

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![DOI](https://zenodo.org/badge/644095707.svg)](https://zenodo.org/doi/10.5281/zenodo.10255367)
[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://img.shields.io/badge/Docker-blue)

Automatic source code file annotation using weak labelling.

## Setup
Clone the repository and the UI submodule [autofl-ui](https://github.com/SasCezar/autofl-ui) by running the following command:

Clone the repository and the UI submodule [autofl-ui](https://github.com/SasCezar/autofl-ui) by running the following
command:

```bash
git clone --recursive git@github.com:SasCezar/AutoFL.git AutoFL
```

### Optional Setup
### Optional Setup

To make use of certain feature like semantic based labelling functions, you need to download the model.
For example, for **w2v-so**, you can download the model from [here](https://github.com/vefstathiou/SO_word2vec), and place it in the [data/models/w2v-so](data/models/w2v-so) folder, or a custom
For example, for **w2v-so**, you can download the model from [here](https://github.com/vefstathiou/SO_word2vec), and
place it in the [data/models/w2v-so](data/models/w2v-so) folder, or a custom
path that you can use in the configs.

## Usage

Run docker the docker compose file [docker-compose.yaml](docker-compose.yaml) by executing:
Run docker compose in the project folder (where the [docker-compose.yaml](docker-compose.yaml) is located) by executing:

```shell
docker compose up
```
in the project folder.

### API Endpoint

You can analyze the files of project by making a request to the endpoint:

```shell
curl -X POST -d '{"name": "<PROJECT_NAME>", "remote": "<PROJECT_REMOTE>", "languages": ["<PROGRAMMING_LANGUAGE>"]}' localhost:8000/label/files -H "content-type: application/json"
```
For example, to analyze the files of [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), you can make the following request:

For example, to analyze the files
of [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), you can make the following
request:

```shell
curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json"
```
Expand All @@ -46,24 +58,30 @@ For more details, check the [UI repo](https://github.com/SasCezar/autofl-ui).
[//]: # (For more details, check the [UI repo]&#40;https://github.com/SasCezar/autofl-ui&#41;)

## Configuration
AutoFL uses [Hydra](https://hydra.cc/) to manage the configuration. The configuration files are located in the [config](config) folder.

AutoFL uses [Hydra](https://hydra.cc/) to manage the configuration. The configuration files are located in
the [config](config) folder.
The main configuration file is [main.yaml](./config/main.yaml), which contains the following options:

- **local**: which environment to use, either local or docker. [Docker](./config/local/docker.yaml) is default.
- **local**: which environment to use, either local or docker. [Docker](./config/local/docker.yaml) is default.
- **taxonomy**: which taxonomy to use. Currently only [gitranking](./config/taxonomy/gitranking.yaml) is supported.
- **annotator**: which annotators to use. Default is [simple](./config/annotator/simple.yaml), which allows good results without extra dependencies on models.
- **version_strategy**: which version strategy to use. Default is [latest](./config/version_strategy/latest.yaml), which will only analyze the latest version of the project.
- **dataloader**: which dataloader to use. Default is [postgres](./config/dataloader/postgres.yaml) which allows the API to fetch already analysed projects.
- **writer**: which writer to use. Default is [postgres](./config/writer/postgres.yaml) which allows the API to store the results in a database.
- **annotator**: which annotators to use. Default is [simple](./config/annotator/simple.yaml), which allows good results
without extra dependencies on models.
- **version_strategy**: which version strategy to use. Default is [latest](./config/version_strategy/latest.yaml), which
will only analyze the latest version of the project.
- **dataloader**: which dataloader to use. Default is [postgres](./config/dataloader/postgres.yaml) which allows the API
to fetch already analysed projects.
- **writer**: which writer to use. Default is [postgres](./config/writer/postgres.yaml) which allows the API to store
the results in a database.

Other configuration can be defined by creating a new file in the folder of the specific component.

## Functionalities

- Annotation (UI/API/Script)
- File
- Package
- Project
- File
- Package
- Project
- Batch Analysis (Script Only)
- Temporal Analysis (**TODO**)
- Classification (**TODO**)
Expand All @@ -78,14 +96,15 @@ Other configuration can be defined by creating a new file in the folder of the s

## Development

### Add New Languages
### Add New Languages

In order to support more languages, a new language specific parser is needed.
In order to support more languages, a new language specific parser is needed.
We can create one quickly by using [tree-sitter](https://tree-sitter.github.io/tree-sitter/),
and a custom parser.

#### Parser
The parser needs to be in the [parser/languages](./src/parser/languages) folder.

The parser needs to be in the [parser/languages](./src/parser/languages) folder.
It has to extend the ```BaseParser``` class, which has the following interface.

```python
Expand All @@ -101,25 +120,28 @@ class ParserBase(ABC):
"""
...
```

And the language specific class has to contain the logic to parse the language to get the identifiers.
For example for Python, the class will look like this:

```python
class PythonParser(ParserBase, lang=Extension.python.name): # The lang argument is used to register the parser in the ParserFactory class.
class PythonParser(ParserBase,
lang=Extension.python.name): # The lang argument is used to register the parser in the ParserFactory class.
"""
Python specific parser. Uses a generic grammar for multiple versions of python. Uses tree_sitter to get the AST
"""

def __init__(self, library_path: Path | str):
super().__init__(library_path)
self.language: Language = Language(library_path, Extension.python.name) # Creates the tree-sitter language for python
self.parser.set_language(self.language) # Sets tree-sitter parser to parse the language

self.language: Language = Language(library_path,
Extension.python.name) # Creates the tree-sitter language for python
self.parser.set_language(self.language) # Sets tree-sitter parser to parse the language

# Pattern used to match the identifiers, it depends on the Lanugage. Check tree-sitter
self.identifiers_pattern: str = """
((identifier) @identifier)
"""

# Creates the query used to find the identifiers in the AST produced by tree-sitter
self.identifiers_query = self.language.query(self.identifiers_pattern)

Expand All @@ -128,14 +150,25 @@ class PythonParser(ParserBase, lang=Extension.python.name): # The lang argument
self.keywords.update(['self', 'cls'])
```

A custom class that does not rely on [tree-sitter](https://github.com/tree-sitter/tree-sitter) can be also used, however, there are more methods from ParserBase that need to be
A custom class that does not rely on [tree-sitter](https://github.com/tree-sitter/tree-sitter) can be also used,
however, there are more methods from ParserBase that need to be
changed. Check the implementation of [ParserBase](src/parser/parser.py).

## Know Issues

- The installation of the dependencies requires quite some time (~10 minutes), and might fail due to timout.
Unfortunately, this issue is hard to reproduce, as it
seems to be related to the network connection. If you encounter this issue, please try again. Future versions will try
to fix this issue by
cleaning up the dependencies and reducing the number of dependencies.
- For some projects, the analysis might loop indefinitely. We are still investigating the cause of this issue.

## Disclaimer

The project is still in development, and it might not work as expected in some cases.
It has been developed and tested on Docker 24.0.7 for ```Ubuntu 22.04```. While minor testing has been done on ```Windows``` and ```MacOS```,
not all functionalities might work due to differences in Docker for these OSs (e.g. Windows uses WSL 2).
The project is offered as is, it still in development, and it might not work as expected in some cases.
It has been developed and tested on Docker 24.0.7 and 25.0.0 for ```Ubuntu 22.04```. While minor testing has been done
on ```Windows``` and ```MacOS```, not all functionalities might work due to differences in Docker for these OSs (e.g.
Windows uses WSL 2).

In case of any problems, please open an issue, make a pull request, or contact me at ```c.a.sas@rug.nl```.

Expand All @@ -144,6 +177,7 @@ In case of any problems, please open an issue, make a pull request, or contact m
If you use this work please cite us:

### Paper

```text
@article{sas2024multigranular,
title = {Multi-granular Software Annotation using File-level Weak Labelling},
Expand All @@ -158,17 +192,20 @@ If you use this work please cite us:
}
```

**Note**: The code used in the paper is available in the [https://github.com/SasCezar/CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification) repository.
However, this tool is more up to date, is easier to use, configurable, and also offers a UI.
**Note**: The code used in the paper is available in
the [https://github.com/SasCezar/CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification)
repository.
However, this tool is more up to date, easier to use, more configurable, and also offers a UI.

### Tool

### Tool
```text
@software{sas2023autofl,
author = {Sas, Cezar and Capiluppi, Andrea},
month = dec,
title = {{AutoFL}},
url = {https://github.com/SasCezar/AutoFL},
version = {0.3.1},
version = {0.4.0},
year = {2023},
url = {https://doi.org/10.5281/zenodo.10255368},
doi = {10.5281/zenodo.10255368}
Expand Down
26 changes: 13 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
[tool.poetry]
name = "autofl"
version = "0.3.1"
version = "0.4.0"
description = ""
authors = ["Cezar Sas <cezar.sas@gmail.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.10,<3.13"
fastapi = "^0.104.1"
fastapi = "^0.109.0"
gunicorn = "^21.2.0"
uvicorn = "^0.24.0"
pandas = "^2.1.3"
uvicorn = "^0.27.0"
pandas = "^2.2.0"
hydra-core = "^1.3.2"
setuptools = "^69.0.0"
multiset = "^3.0.1"
scikit-learn = "^1.3.2"
pydantic = "^2.5.2"
gitpython = "^3.1.40"
setuptools = "^69.0.3"
multiset = "^3.0.2"
scikit-learn = "^1.4.0"
pydantic = "^2.5.3"
gitpython = "^3.1.41"
loguru = "^0.7.1"
tqdm = "^4.66.1"
# yake = { git = "git@github.com:LIAAD/yake.git" }
python-rake = "^1.5.0"
more-itertools = "^10.1.0"
more-itertools = "^10.2.0"
tree-sitter = "^0.20.4"
sqlalchemy = "^2.0.21"
psycopg = {extras = ["binary"], version = "^3.1.14"}
sqlalchemy = "^2.0.25"
psycopg = {extras = ["binary"], version = "^3.1.17"}
gensim = "^4.3.2"
fasttext-wheel = "^0.9.2"
transformers = "^4.35.2"
transformers = "^4.37.1"
sentence-transformers = "^2.2.2"

[tool.poetry.group.dev.dependencies]
Expand Down

0 comments on commit 5d230cd

Please sign in to comment.