Skip to content

Commit

Permalink
Create paper projections for paper similarity graph (#63)
Browse files Browse the repository at this point in the history
* Add papers.csv creator and UMAP projection in scripts/reduce.py

* Reformat code

* Add similar papers in poster page

* Add guide to produce similar paper recommendations

* Reformat main.py

* make image_path configurable

* format create_papers_csv

* Update with latest papers.csv and simplify code

* Remove unused imports

* Ignore typecheck for openreview and umap-learn

* Modify poster.html to get correct id field

* refactor templates/poster.html

* Update README.md

* Update README.recommendations.md

Co-authored-by: Hao Fang <haofang1990@gmail.com>
  • Loading branch information
georgepar and hao-fang authored Jun 15, 2020
1 parent fb878cc commit e5a817e
Show file tree
Hide file tree
Showing 7 changed files with 139 additions and 2 deletions.
7 changes: 7 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,13 @@ def poster(poster):
uid = poster
v = by_uid["papers"][uid]
data = _data()

data["openreview"] = format_paper(by_uid["papers"][uid])
data["id"] = uid
data["paper_recs"] = [
format_paper(by_uid["papers"][n]) for n in site_data["paper_recs"][uid]
][1:]

data["paper"] = format_paper(v)
return render_template("poster.html", **data)

Expand Down
5 changes: 4 additions & 1 deletion scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
This directory contains extensions to help support the mini-conf library.

For the updated procedure on getting similar papers + recommendations refer to README.recommendations.md


These include:

* `embeddings.py` : For turning abstracts into embeddings. Creates an `embeddings.torch` file.
Expand All @@ -17,7 +20,7 @@ python3 scripts/generate_version.py build/version.json
* `reduce.py` : For creating two-dimensional representations of the embeddings.

```bash
python embeddings.py ../sitedata/papers.csv embeddings.torch > ../sitedata/papers_projection.json
python reduce.py ../sitedata/papers.csv embeddings.torch > ../sitedata/papers_projection.json --projection-method umap
```

* `parse_calendar.py` : to convert a local or remote ICS file to JSON. -- more on importing calendars see [README_Schedule.md](README_Schedule.md)
Expand Down
35 changes: 35 additions & 0 deletions scripts/README.recommendations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# How to get similar paper recommendations

In this guide we can see how to get paper recommendations using the pretrained model provided
from [ICLR webpage](https://github.com/ICLR/iclr.github.io/tree/master/recommendations) and abstract embeddings.



## Create a visualization based on BERT embeddings

1. Grab ACL2020
[papers.csv](https://github.com/acl-org/acl-2020-virtual-conference-sitedata/blob/add_acl2020_accepted_papers_tsv/papers.csv)
from this branch or a more recent version and copy it to `sitedata_acl2020`.
2. Run `python scripts/embeddings.py sitedata_acl2020/papers.csv` to produce the BERT embeddings
for the paper abstracts.
3. Run `python scripts/reduce.py --projection-method [tsne|umap] sitedata_acl2020/papers.csv embeddings.torch > sitedata_acl2020/papers_projection.json`
to produce a 2D projection of the BERT embeddings for visualization. `--projection-method`
selects which dimensionality reduction technique to use.
4. Rerun `make run` and go to the paper visualization page


## Produce similar paper recommendations

1. Run `python scripts/create_recommendations_pickle.py --inp sitedata_acl2020/papers.csv --out cached_or.pkl` to produce `cached_or.pkl`.
This file is compatible with the inference scripts provided in [https://github.com/ICLR/iclr.github.io/tree/master/recommendations](https://github.com/ICLR/iclr.github.io/tree/master/recommendations)
2. Clone [https://github.com/ICLR/iclr.github.io](https://github.com/ICLR/iclr.github.io). You will
need `git-lfs` installed.
3. `cp cached_or.pkl iclr.github.io && cd iclr.github.io/recommendations`
4. Install missing requirements
5. `python recs.py`. This will run inference using a pretrained similarity model and produce the
`rec.pkl` file that contains the paper similarities.
6. You can use the `iclr.github.io/data/pkl_to_json.py` script to produce the `paper_recs.json`
file that contains the similar paper recommendations that can be displayed to the website. Make
sure to modify the filepaths to point to the correct `cached_or.pkl`, `rec.pkl`.
7. Grab the produced `paper_recs.json` file and copy it to `sitedata_acl2020`. A version of this file
produced using this method is [here](https://github.com/acl-org/acl-2020-virtual-conference-sitedata/blob/add_acl2020_accepted_papers_tsv/paper_recs.json)
50 changes: 50 additions & 0 deletions scripts/create_recommendations_pickle.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import argparse
import csv
import pickle

import openreview # type: ignore

# No type hints for openreview-py package. Ignore mypy


def read_entries(papers_csv):
with open(papers_csv, "r") as fd:
entries = list(csv.reader(fd, skipinitialspace=True))
entries = entries[1:] # skip header

return entries


def dump_cached_or(entries, out_pickle):
cached_or = {}
for entry in entries:
cached_or[entry[0]] = openreview.Note( # id
"", [], [], [], {"abstract": entry[3], "title": entry[1]}
) # Hack. ICLR Recommender script accepts Openreview notes

with open(out_pickle, "wb") as fd:
pickle.dump(cached_or, fd)


def parse_args():
parser = argparse.ArgumentParser(
description="Convert CSV from original ACL format to Miniconf "
"compatible format"
)
parser.add_argument("--inp", type=str, help="papers.csv")
parser.add_argument(
"--out",
type=str,
help="Dump entries into a pickle compatible with " "ICLR Recommendation engine",
)
return parser.parse_args()


def main():
args = parse_args()
entries = read_entries(args.inp)
dump_cached_or(entries, args.out)


if __name__ == "__main__":
main()
15 changes: 14 additions & 1 deletion scripts/reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,34 @@

import sklearn.manifold
import torch
import umap # type: ignore

# No type stubs for umap-learn. Ignore mypy


def parse_arguments():
parser = argparse.ArgumentParser(description="MiniConf Portal Command Line")
parser.add_argument("papers", default=False, help="paper file")

parser.add_argument("embeddings", default=False, help="embeddings file to shrink")
parser.add_argument("--projection-method", default="tsne", help="[umap|tsne]")

return parser.parse_args()


if __name__ == "__main__":
args = parse_arguments()
emb = torch.load(args.embeddings)
out = sklearn.manifold.TSNE(n_components=2).fit_transform(emb.numpy())
if args.projection_method == "tsne":
out = sklearn.manifold.TSNE(n_components=2).fit_transform(emb.numpy())
elif args.projection_method == "umap":
out = umap.UMAP(
n_neighbors=5, min_dist=0.3, metric="correlation", n_components=2
).fit_transform(emb.numpy())
else:
print("invalid projection-method: {}".format(args.projection_method))
print("Falling back to T-SNE")
out = sklearn.manifold.TSNE(n_components=2).fit_transform(emb.numpy())
d = []
with open(args.papers, "r") as f:
abstracts = list(csv.DictReader(f))
Expand Down
2 changes: 2 additions & 0 deletions scripts/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
transformers
sklearn
umap-learn
openreview-py
torch==1.4.0
ics
27 changes: 27 additions & 0 deletions templates/poster.html
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,33 @@ <h5 style="color: red;">
})
</script>

<div class="container" style="padding-bottom: 30px; padding-top:30px">
<center>
<h2> Similar Papers </h2>
</center>
</div>
<p></p>
<div class="container" >
<div class="row">
{% for recommended in paper_recs %}
<div class="col-md-4 col-xs-6">
<div class="pp-card" >
<div class="pp-card-header" class="text-muted">
<a href="poster_{{recommended.id}}.html" class="text-muted">
<h5 class="card-title" align="center">{{recommended.content.title}}</h5>
</a>
<h6 class="card-subtitle text-muted" align="center">
{% for a in recommended.content.authors %}
{{a}},
{% endfor %}
</h6>
<center><img class="cards_img" src="{{config.image_path}}/{{recommended.id}}.png" width="80%"/></center>
</div>
</div>
</div>
{% endfor %}
</div>
</div>



Expand Down

0 comments on commit e5a817e

Please sign in to comment.