Skip to content

Multilingual Coref

Latest
Compare
Choose a tag to compare
@AngledLuffa AngledLuffa released this 12 Sep 23:17

multilingual coref!

  • Added models which cover several different languages: one for combined Germanic and Romance languages, one for the Slavic languages available in UDCoref #1406

new features

  • streamlit visualizer for semgrex/ssurgeon #1396
  • updates to the constituency parser ensemble #1387
  • accuracy improvements to the IN_ORDER oracle #1391
  • Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
  • download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

  • Spanish combined models #1395
  • Add IACLT knesset to the HE combined models
  • NER based on IACLT
  • XCL (Classical Armenian) models with word vectors from Caval

bugfixes

  • update tqdm usage to remove some duplicate code: #1413 3de69ca
  • long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
  • Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
  • actually include the visualization: #1421 thank you @bollwyvl