Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARKNLP-1031] Solves Dependency Parsers training issue #14225

Conversation

danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Apr 3, 2024

Description

This PR introduces critical enhancements and optimizations to the processing of the CoNLL-U format, which is instrumental in the training of Dependency Parsers. The key improvements include:

  • Enhanced Multiword Token Handling: This update ensures proper processing of lines identified by id columns as multiword tokens (e.g., 2-3 no _ _ _ _ _ _ _ _). This adjustment guarantees that multiword tokens are accurately recognized and managed throughout the parsing process.

  • Improved Handling of Missing uPos Values: Before this change, lines with unavailable uPos values could disrupt the parsing flow. With the current enhancements, the system gracefully handles such scenarios, ensuring uninterrupted parsing operations even in the absence of uPos values.

Beyond these functional enhancements, this PR undertakes a comprehensive refactoring of the underlying codebase. The refactoring efforts focus on enhancing code readability, cleanliness, and maintainability. These improvements pave the way for easier future modifications and debugging, aligning with best practices in software development.

Motivation and Context

Solves issue #14214

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google colab notebook

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl added bug-fix DON'T MERGE Do not merge this PR labels Apr 3, 2024
@maziyarpanahi maziyarpanahi changed the base branch from master to release/533-release-candidate April 5, 2024 15:40
@maziyarpanahi maziyarpanahi merged commit 75dbfcc into release/533-release-candidate Apr 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix DON'T MERGE Do not merge this PR
Projects
None yet
2 participants