Skip to content

Commit

Permalink
fix: 🐛 fix all ImportError exceptions for the current datasets
Browse files Browse the repository at this point in the history
Datasets sometimes use libraries in their script. We install all of them
(as of today).

Note that https://github.com/TREMA-UNH/trec-car-tools
could not be added using poetry since the package is in subdirectory.
See python-poetry/poetry#755.
  • Loading branch information
severo committed Jul 29, 2021
1 parent 4bfaf3e commit b3ec3ee
Show file tree
Hide file tree
Showing 24 changed files with 4,524 additions and 98 deletions.
1,973 changes: 1,875 additions & 98 deletions poetry.lock

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,20 @@ python = "^3.8"
datasets = {extras = ["streaming"], version = "^1.10.2"}
starlette = "^0.16.0"
uvicorn = "^0.14.0"
Pillow = "^8.3.1"
trec-car-tools = {path = "vendors/trec-car-tools/python3"}
apache-beam = "^2.31.0"
conllu = "^4.4"
kss = "^2.5.1"
lm-dataformat = "^0.0.19"
lxml = "^4.6.3"
nlp = "^0.4.0"
openpyxl = "^3.0.7"
py7zr = "^0.16.1"
tensorflow = "^2.5.0"
transformers = "^4.9.1"
wget = "^3.2"
kenlm = {url = "https://github.com/kpu/kenlm/archive/master.zip"}

[tool.poetry.dev-dependencies]
black = "^21.7b0"
Expand Down
31 changes: 31 additions & 0 deletions vendors/trec-car-tools/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Maven template
target/
pom.xml.tag
pom.xml.releaseBackup
pom.xml.versionsBackup
pom.xml.next
release.properties
dependency-reduced-pom.xml
buildNumber.properties
.mvn/timing.properties

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# Sphinx documentation
python3/_build/
20 changes: 20 additions & 0 deletions vendors/trec-car-tools/.travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
language: python
python:
- "3.5"
before_install:
- sudo apt-get -qq update
- sudo apt-get install -y maven
python:
- "3.6"
- "3.7"
install:
- pip install -r python3/requirements.txt
script:
- pip install python3/
- pushd trec-car-tools-example; mvn install; popd

- curl http://trec-car.cs.unh.edu/datareleases/v2.0/test200.v2.0.tar.xz | tar -xJ
- pages=test200/test200-train/train.pages.cbor outlines=test200/test200-train/train.pages.cbor-outlines.cbor paragraphs=test200/test200-train/train.pages.cbor-paragraphs.cbor bash .travis/test.sh

- curl http://trec-car.cs.unh.edu/datareleases/v1.5/test200-v1.5.tar.xz | tar -xJ
- pages=test200/train.test200.cbor outlines=test200/train.test200.cbor paragraphs=test200/train.test200.cbor.paragraphs bash .travis/test.sh
13 changes: 13 additions & 0 deletions vendors/trec-car-tools/.travis/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

set -ex

python3/test.py pages $pages >/dev/null
python3/test.py outlines $outlines >/dev/null
python3/test.py paragraphs $paragraphs >/dev/null

cd trec-car-tools-example/
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="header ../$pages" >/dev/null
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="pages ../$pages" >/dev/null
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="outlines ../$outlines" >/dev/null
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="paragraphs ../$paragraphs" >/dev/null
29 changes: 29 additions & 0 deletions vendors/trec-car-tools/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
BSD 3-Clause License

Copyright (c) 2017, Laura Dietz and Ben Gamari
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
162 changes: 162 additions & 0 deletions vendors/trec-car-tools/README.mkd
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# TREC Car Tools

[![Travis badge](https://travis-ci.org/TREMA-UNH/trec-car-tools.svg?branch=master)](https://travis-ci.org/TREMA-UNH/trec-car-tools)

Development tools for participants of the TREC Complex Answer Retrieval track.

Data release support for v1.5 and v2.0.

Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to `treccar-tools-v2` with version 2.0, and the package path changed to `treccar_v2`


Current support for
- Python 3.6
- Java 1.8

If you are using [Anaconda](https://www.anaconda.com/), install the `cbor`
library for Python 3.6:
```
conda install -c laura-dietz cbor=1.0.0
```

## How to use the Python bindings for trec-car-tools?

1. Get the data from [http://trec-car.cs.unh.edu](http://trec-car.cs.unh.edu)
2. Clone this repository
3. `python setup.py install`

Look out for test.py for an example on how to access the data.


## How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?

add to your project's pom.xml file (or similarly gradel or sbt):

~~~~
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
~~~~

add the trec-car-tools dependency:

~~~~
<dependency>
<groupId>com.github.TREMA-UNH</groupId>
<artifactId>trec-car-tools-java</artifactId>
<version>17</version>
</dependency>
~~~~

compile your project with `mvn compile`




## Tool support

This package provides support for the following activities.

- `read_data`: Reading the provided paragraph collection, outline collections, and training articles
- `format_runs`: writing submission files


## Reading Data

If you use python or java, please use `trec-car-tools`, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below.

[CBOR](cbor.io) is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues.

Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through `ParaLink`s.


~~~~~
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata
PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
RedirectNames -> [$pageName]
DisambiguationNames -> [$pageName]
DisambiguationIds -> [$pageId]
CategoryNames -> [$pageName]
CategoryIds -> [$pageId]
InlinkIds -> [$pageId]
InlinkAnchors -> [$anchorText]
PageSkeleton -> Section | Para | Image | ListItem
Section -> $sectionHeading [PageSkeleton]
Para -> Paragraph
Paragraph -> $paragraphId, [ParaBody]
ListItem -> $nestingLevel, Paragraph
Image -> $imageURL [PageSkeleton]
ParaBody -> ParaText | ParaLink
ParaText -> $text
ParaLink -> $targetPage $targetPageId $linkSection $anchorText
~~~~~

You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5)

- `./read_data/trec_car_read_data.py`
Python 3.5 convenience library for reading the input data (in CBOR format).
-- If you use anaconda, please install the cbor library with `conda install -c auto cbor=1.0`
-- Otherwise install it with `pypi install cbor`

## Ranking Results

Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations).

Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section.

The section is represented by the path of headings in the outline `$pageTitle/$heading1/$heading1.1/.../$section` in URL encoding.

The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted.

The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted.


The results are provided in a format that is similar to the "trec\_results file format" of [trec_eval](http://trec.nist.gov/trec_eval). More info on how to use [trec_eval](http://stackoverflow.com/questions/4275825/how-to-evaluate-a-search-retrieval-engine-using-trec-eval) and [source](https://github.com/usnistgov/trec_eval).

Example of ranking format
~~~~~
Green\_sea\_turtle\Habitat Pelagic\_zone 12345 0 27409 myTeam
$qid $entity $passageId rank sim run_id
~~~~~



## Integration with other tools

It is recommended to use the `format_runs` package to write run files. Here an example:


with open('runfile', mode='w', encoding='UTF-8') as f:
writer = configure_csv_writer(f)
for page in pages:
for section_path in page.flat_headings_list():
ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking]
format_run(writer, ranking, exp_name='test')

f.close()

This ensures that the output is correctly formatted to work with `trec_eval` and the provided qrels file.

Run [trec_eval](https://github.com/usnistgov/trec_eval/blob/master/README) version 9.0.4 as usual:

trec_eval -q release.qrel runfile > run.eval

The output is compatible with the eval plotting package [minir-plots](https://github.com/laura-dietz/minir-plots). For example run

python column.py --out column-plot.pdf --metric map run.eval
python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval

Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows.

python hurtshelps.py --metric map run.eval run2.eval
python paired-ttest.py --metric map run.eval run2.eval




<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/3.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Dataset" property="dct:title" rel="dct:type">TREC-CAR Dataset</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="trec-car.cs.unh.edu" property="cc:attributionName" rel="cc:attributionURL">Laura Dietz, Ben Gamari</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.<br />Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="www.wikipedia.org" rel="dct:source">www.wikipedia.org</a>.
Loading

0 comments on commit b3ec3ee

Please sign in to comment.