qbreader/packet-parser

A complete program to automatically parse quizbowl packets, such as from quizbowlpackets.com, for use in the qbreader/website.

Can also be used to parse individual packets.
Includes a question category/subcategory classifier.
Designed to work well with a variety of packet formats - can parse packets "as-is".

WARNING: Although the program can parse pdf files, results may vary - pdf parsing is notoriously inconsistent.

How to Use

Make sure you have python3 and pip installed on your computer.

Clone the repository and cd into the folder.
Install necessary python libraries (e.g. pdf2docx, regex, python-docx) with pip install -r requirements.txt.
Download the packets, either manually or using download-set.sh.
- If the packets are .docx files, then place the packets in a folder called p-docx.
- If the packets are .pdf files, then place the packets in a folder called p-pdf.
- If the packets are .txt files, then place the packets in a folder called packets
- If you are using a Mac，you can use MacroRecorder along with the provided macro file doc-to-docx.mrf to automatically convert doc files to docx files.
  - Note: There are known issues with this macro; I've only gotten it to work on 16" Macbook Pros.
If the packets are .docx or .pdf files, then run the to-txt.sh file to convert them to .txt files.
Run the packet-parser.py python file. Specify the -m flag if you want to output in a format compatible with MODAQ. Specify the -b flag if you want to output in a format compatible with buzzpoints.
- The script will prompt you if the packets have category tags. You can check by seeing if there are tags that look like one of the following in the packets: (If unsure, reply with "n").
  - <Science - Biology>
  - <Biology>
  - <Ed. Wu - Biology>
  - <GW - Science, Biology>
- If any errors appear during the text -> json step, delete the output/ folder, fix any mistakes in packets/, and run packet-parser.py.

Command Line Options

You can more info by running python packet-parser.py --help. Here are some common flags/options:

-b, --buzzpoint: Output in a format compatible with buzzpoints.
-m, --modaq: Output in a format compatible with MODAQ.
-p, --auto-insert-powermarks: Insert powermarks for questions that are bolded in power but do not have an explicit powermark. Most useful for old Chicago Open packets.
-s, --space-powermarks: Ensure powermarks (*) are surrounded by spaces. MODAQ and qbreader both expect this for powers to correctly register.
-e, -l, --bonus-length INTEGER [default: 3]: The number of parts in a bonus. Useful when you don't have 3-part bonuses (e.g. MUSES).

Errors

When running the packet parser, it's possible that you'll run into WARNINGS and ERRORS. This is due to errors in formatting of the packets. Common errors include:

WARNING: tossup {question #} answer may contain the next question
WARNING: bonus {question #} leadin may contain the previous question
- These two errors likely means that you're missing a question number. Try adding a 1. in front of the next question.
ERROR: bonus {question #} has fewer than {EXPECTED_BONUS_LENGTH} parts
- This likely means you're missing a [10] somewhere, or it's mistyped (such as [10[)

Preprocessing

If the bonus parts don't have the [10] in front of them, try adding them by matching using one of the two regexes below:

(?=^[^(].*\nANSWER:)
(?=^[^0-9].*\nANSWER:)

Postprocessing Packet Names

UPDATE: I now recommend using the Batch Rename extension for VSCode and using multiline editing to rename the files.

Although most modern file explorers (including VS Code) are smart enough to figure out the order of the packet numbers to order the number part numerically, the program to upload the packets is not. Instead, they order them lexically, like so:

Packet 1.json
Packet 10.json
Packet 11.json
Packet 12.json
Packet 2.json
Packet 3.json
Packet 4.json
Packet 5.json
Packet 6.json
Packet 7.json
Packet 8.json
Packet 9.json

It's a good idea to add 0's in front of all the single-digit names to make sure that they are ordered correctly. Furthermore, it's a good idea to remove the "Packet" part of all the names or any other redundant info (such as set names), since they're unnecessary. Generally speaking, this includes phrases that appear in every packet name, and does NOT include the list of schools that wrote the packet (which is commonly the case for ACF packets). The final result will look like this:

01.json
02.json
03.json
04.json
05.json
06.json
07.json
08.json
09.json
10.json
11.json
12.json

Remove first 7 characters from each file name:

cd output
for f in *; do mv "$f" "${f:7}"; done

Rename files from x.json to 0x.json:¹

cd output
for f in *; do if [ ${#f} = 6 ] ; then mv "$f" "0${f}"; fi; done

Classifier

This repository includes a classifier located in the classifier/ directory, which is a Naive Bayes classifier that uses additive smoothing controlled by $\epsilon$, the smoothing parameter. The default value of $\epsilon$ is $0.01$.

Performance

Methodology: The data was shuffled using numpy with a set seed of 0, and the split into an 80/20 train/test split. Below is the accuracy and time² for a 20% test set:

Naive Bayes accuracy / time:  85.35% (46053/53955) / 19.45 seconds (0.360 milliseconds per question)

QuizDB

UPDATE: As of November 28th, 2022, QuizDB has been shut down. nocard.org is the closest replacement, but it does not support exporting questions to text, csv, or json.

The QuizDB folder contains appropriate files to convert questions from the QuizDB JSON format to the QB Reader format. Not recommended since most questions on QuizDB are not particularly well formatted and it may introduce a high amount of load on the QuizDB server.

Make a QuizDB query by selecting a tournament, clearing all other fields, and pressing search.
Click the JSON button and move the downloaded file to the quizdb folder.
Run quizdb-process.py.
Run change_cat_names.py.

Background:

I needed a way to automatically download and parse packets for QB Reader. I wrote this program after running into issues with formatting requirements and lack of category support when using YAPP. YAPP is awesome and powers an awesome moderation tool, MODAQ.

The number 6 comes from the fact that the length of x.json is 6 characters long. Modify as you please for other extensions and use cases. ↩
The amount of time it took to classify all of the test samples. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
.vscode		.vscode
classifier		classifier
modules		modules
quizdb		quizdb
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
bcolors.py		bcolors.py
doc-to-docx.mrf		doc-to-docx.mrf
download-set.sh		download-set.sh
get-set.sh		get-set.sh
packet-parser.py		packet-parser.py
requirements.txt		requirements.txt
to-txt.sh		to-txt.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qbreader/packet-parser

How to Use

Command Line Options

Errors

Preprocessing

Postprocessing Packet Names

Classifier

Performance

QuizDB

Background:

About

Contributors 4

Languages

qbreader/packet-parser

Folders and files

Latest commit

History

Repository files navigation

qbreader/packet-parser

How to Use

Command Line Options

Errors

Preprocessing

Postprocessing Packet Names

Classifier

Performance

QuizDB

Background:

Footnotes

About

Resources

Stars

Watchers

Forks

Contributors 4

Languages