Create paper projections for paper similarity graph (#63)

* Add papers.csv creator and UMAP projection in scripts/reduce.py * Reformat code * Add similar papers in poster page * Add guide to produce similar paper recommendations * Reformat main.py * make image_path configurable * format create_papers_csv * Update with latest papers.csv and simplify code * Remove unused imports * Ignore typecheck for openreview and umap-learn * Modify poster.html to get correct id field * refactor templates/poster.html * Update README.md * Update README.recommendations.md Co-authored-by: Hao Fang <haofang1990@gmail.com>
acl-org · Jun 15, 2020 · e5a817e · e5a817e
1 parent fb878cc
commit e5a817e
Show file tree

Hide file tree

Showing 7 changed files with 139 additions and 2 deletions.
diff --git a/main.py b/main.py
@@ -198,6 +198,13 @@ def poster(poster):
     uid = poster
     v = by_uid["papers"][uid]
     data = _data()
+
+    data["openreview"] = format_paper(by_uid["papers"][uid])
+    data["id"] = uid
+    data["paper_recs"] = [
+        format_paper(by_uid["papers"][n]) for n in site_data["paper_recs"][uid]
+    ][1:]
+
     data["paper"] = format_paper(v)
     return render_template("poster.html", **data)
 

diff --git a/scripts/README.md b/scripts/README.md
@@ -1,5 +1,8 @@
 This directory contains extensions to help support the mini-conf library.
 
+For the updated procedure on getting similar papers + recommendations refer to README.recommendations.md
+
+
 These include:
 
 * `embeddings.py` : For turning abstracts into embeddings. Creates an `embeddings.torch` file. 
@@ -17,7 +20,7 @@ python3 scripts/generate_version.py build/version.json
 * `reduce.py` : For creating two-dimensional representations of the embeddings.
 
 ```bash
-python embeddings.py ../sitedata/papers.csv embeddings.torch > ../sitedata/papers_projection.json
+python reduce.py ../sitedata/papers.csv embeddings.torch > ../sitedata/papers_projection.json --projection-method umap
 ```
 
 * `parse_calendar.py` : to convert a local or remote ICS file to JSON. -- more on importing calendars see [README_Schedule.md](README_Schedule.md)

diff --git a/scripts/README.recommendations.md b/scripts/README.recommendations.md
@@ -0,0 +1,35 @@
+# How to get similar paper recommendations
+
+In this guide we can see how to get paper recommendations using the pretrained model provided
+from [ICLR webpage](https://github.com/ICLR/iclr.github.io/tree/master/recommendations) and abstract embeddings.
+
+
+
+## Create a visualization based on BERT embeddings
+
+1. Grab ACL2020
+   [papers.csv](https://github.com/acl-org/acl-2020-virtual-conference-sitedata/blob/add_acl2020_accepted_papers_tsv/papers.csv)
+   from this branch or a more recent version and copy it to `sitedata_acl2020`.
+2. Run `python scripts/embeddings.py sitedata_acl2020/papers.csv` to produce the BERT embeddings
+   for the paper abstracts.
+3. Run `python scripts/reduce.py --projection-method [tsne|umap] sitedata_acl2020/papers.csv embeddings.torch > sitedata_acl2020/papers_projection.json`
+   to produce a 2D projection of the BERT embeddings for visualization. `--projection-method`
+   selects which dimensionality reduction technique to use.
+4. Rerun `make run` and go to the paper visualization page
+
+
+## Produce similar paper recommendations
+
+1. Run `python scripts/create_recommendations_pickle.py --inp sitedata_acl2020/papers.csv --out cached_or.pkl` to produce `cached_or.pkl`.
+   This file is compatible with the inference scripts provided in [https://github.com/ICLR/iclr.github.io/tree/master/recommendations](https://github.com/ICLR/iclr.github.io/tree/master/recommendations)
+2. Clone [https://github.com/ICLR/iclr.github.io](https://github.com/ICLR/iclr.github.io). You will
+   need `git-lfs` installed.
+3. `cp cached_or.pkl iclr.github.io && cd iclr.github.io/recommendations`
+4. Install missing requirements
+5. `python recs.py`. This will run inference using a pretrained similarity model and produce the
+   `rec.pkl` file that contains the paper similarities.
+6. You can use the `iclr.github.io/data/pkl_to_json.py` script to produce the `paper_recs.json`
+   file that contains the similar paper recommendations that can be displayed to the website. Make
+   sure to modify the filepaths to point to the correct `cached_or.pkl`, `rec.pkl`.
+7. Grab the produced `paper_recs.json` file and copy it to `sitedata_acl2020`. A version of this file
+   produced using this method is [here](https://github.com/acl-org/acl-2020-virtual-conference-sitedata/blob/add_acl2020_accepted_papers_tsv/paper_recs.json)
diff --git a/scripts/create_recommendations_pickle.py b/scripts/create_recommendations_pickle.py
@@ -0,0 +1,50 @@
+import argparse
+import csv
+import pickle
+
+import openreview  # type: ignore
+
+# No type hints for openreview-py package. Ignore mypy
+
+
+def read_entries(papers_csv):
+    with open(papers_csv, "r") as fd:
+        entries = list(csv.reader(fd, skipinitialspace=True))
+        entries = entries[1:]  # skip header
+
+    return entries
+
+
+def dump_cached_or(entries, out_pickle):
+    cached_or = {}
+    for entry in entries:
+        cached_or[entry[0]] = openreview.Note(  # id
+            "", [], [], [], {"abstract": entry[3], "title": entry[1]}
+        )  # Hack. ICLR Recommender script accepts Openreview notes
+
+    with open(out_pickle, "wb") as fd:
+        pickle.dump(cached_or, fd)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Convert CSV from original ACL format to Miniconf "
+        "compatible format"
+    )
+    parser.add_argument("--inp", type=str, help="papers.csv")
+    parser.add_argument(
+        "--out",
+        type=str,
+        help="Dump entries into a pickle compatible with " "ICLR Recommendation engine",
+    )
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+    entries = read_entries(args.inp)
+    dump_cached_or(entries, args.out)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/reduce.py b/scripts/reduce.py
@@ -4,21 +4,34 @@
 
 import sklearn.manifold
 import torch
+import umap  # type: ignore
+
+# No type stubs for umap-learn. Ignore mypy
 
 
 def parse_arguments():
     parser = argparse.ArgumentParser(description="MiniConf Portal Command Line")
     parser.add_argument("papers", default=False, help="paper file")
 
     parser.add_argument("embeddings", default=False, help="embeddings file to shrink")
+    parser.add_argument("--projection-method", default="tsne", help="[umap|tsne]")
 
     return parser.parse_args()
 
 
 if __name__ == "__main__":
     args = parse_arguments()
     emb = torch.load(args.embeddings)
-    out = sklearn.manifold.TSNE(n_components=2).fit_transform(emb.numpy())
+    if args.projection_method == "tsne":
+        out = sklearn.manifold.TSNE(n_components=2).fit_transform(emb.numpy())
+    elif args.projection_method == "umap":
+        out = umap.UMAP(
+            n_neighbors=5, min_dist=0.3, metric="correlation", n_components=2
+        ).fit_transform(emb.numpy())
+    else:
+        print("invalid projection-method: {}".format(args.projection_method))
+        print("Falling back to T-SNE")
+        out = sklearn.manifold.TSNE(n_components=2).fit_transform(emb.numpy())
     d = []
     with open(args.papers, "r") as f:
         abstracts = list(csv.DictReader(f))

diff --git a/scripts/requirements.txt b/scripts/requirements.txt
@@ -1,4 +1,6 @@
 transformers
 sklearn
+umap-learn
+openreview-py
 torch==1.4.0
 ics
diff --git a/templates/poster.html b/templates/poster.html
@@ -139,6 +139,33 @@ <h5 style="color: red;">
     })
 </script>
 
+<div class="container" style="padding-bottom: 30px; padding-top:30px">
+  <center>
+    <h2> Similar Papers </h2>
+  </center>
+</div>
+<p></p>
+<div  class="container" >
+  <div class="row">
+  {% for recommended in paper_recs %}
+    <div class="col-md-4 col-xs-6">
+      <div class="pp-card" >
+        <div class="pp-card-header" class="text-muted">
+          <a href="poster_{{recommended.id}}.html" class="text-muted">
+            <h5 class="card-title" align="center">{{recommended.content.title}}</h5>
+          </a>
+          <h6 class="card-subtitle text-muted" align="center">
+             {% for a in recommended.content.authors %}
+             {{a}},
+             {% endfor %}
+          </h6>
+          <center><img class="cards_img" src="{{config.image_path}}/{{recommended.id}}.png" width="80%"/></center>
+        </div>
+      </div>
+    </div>
+  {% endfor %}
+  </div>
+</div>