Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publication Date Parsing #14

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Publication Date Parsing #14

wants to merge 6 commits into from

Conversation

RustySheep
Copy link

Description

This pull request adds a new publication_dt field to the Publication model in the webmonitor app. The publication_dt field is a DateField that represents the date the publication was published.

In order to populate the publication_dt field, the datefinder package was added as a requirement, and a new function was implemented to parse the publication date from the existing publication_date field.

Additionally, a new migration was prepared to add the publication_dt field to the database schema. A new clean_publication_dt method was added to the ScrapedPublicationForm to validate the publication_dt field. Finally, the publication_dt field was registered with the PublicationSerializer to ensure that it is included in the REST API.

Changes Made

  1. Added datefinder to the list of requirements in requirements.txt.
  2. Implemented a new function parse_date in utils/date.py to parse the publication date from the publication_date field.
  3. Added a new publication_dt field to the Publication model in models/publication.py.
  4. Prepared a new migration to add the publication_dt field to the database schema.
  5. Added a new clean_publication_dt method to the ScrapedPublicationForm in forms/publication.py to validate the publication_dt field.
  6. Registered the publication_dt field with the PublicationSerializer in api/serializers/publication.py to ensure that it is included in the REST API. Also registered at PublicationAdmin in admin/publication.py and PublicationDocument in documents/publication.py

Testing

I tested these changes on my local development environment and verified that the publication_dt field is added to the Publication model, the clean_publication_dt method validates the publication_dt field correctly, and the publication_dt field is included in the REST API.

I also verified that the new migration can be applied successfully using the python manage.py migrate command.

@splitbrain splitbrain requested a review from o-zander May 26, 2023 11:01
@splitbrain splitbrain changed the title Swp rs 1 Publication Date Parsing May 30, 2023
@o-zander
Copy link
Contributor

Hi @RustySheep,

thanks for your well prepared submission.

I had a look at your implementation and will add line comments later. But first I would discuss this topic on a more general level.

I already had my hands on this topic and played around with the datefinder library a while ago. I don't know in which extent you had access to the real data from the production DB (I had) but mit findings where that this library and approach isn't really suitable for our data. In many cases we have incomplete "date data". The publication field data contains texts like "May 2022" or "5 months ago", which can't be parsed to "complete" dates. There are two solutions for this problem in my opinion:

  1. Fill the missing information with ones, so "May 2022" is going to be 2022-01-01 (more or likely what you approach is) but loosing the information, that it is not really the first of may.
  2. Do not work with dates in the first place but use three separate fields for year, month and day.

I prefer the second solution. But this depends on the needs of the application.

I wrote a date parsing library on my own which is tailored to the data I found in the live database (as of June 22). You can have a look at it in MR #15. First I use the tools django has built in which use very specific regex patterns and then try to match some own regex patterns from really specific ones to more fractional ones. With this approach I was able to match every string in the database - even "5 months ago", which needs to be resolved to a date by using the created date in a second step.

The decision how to handle the data really depends on the requirement and what the data should be used for. Have you run your code against live data?

@@ -64,6 +66,11 @@ def clean_authors(self) -> Sequence[str]:

return [clean(author) for author in items]

def clean_publication_dt(self) -> datetime.date:
value = self.truncated_field('publication_date')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary to use a truncated value here. You can directly use self.cleaned_data.

@@ -64,6 +66,11 @@ def clean_authors(self) -> Sequence[str]:

return [clean(author) for author in items]

def clean_publication_dt(self) -> datetime.date:
value = self.truncated_field('publication_date')
default = f'{datetime.date.today().year}-01-01'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working with dates and datetimes in django consider using the timezone aware utils from django.utils.timezone.

name='publication_dt',
field=models.DateField(blank=True, null=True, verbose_name='publication dt'),
),
migrations.RunPython(code=migrate_date),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a reverse migration. In this case a migrations.RunPython.noop is sufficient. But with out it, this migration cannot be reverted.


for publication in Publication.objects.all():
default = publication.created.date().isoformat()
publication.publication_dt = parse_date(publication.publication_date, default_date_str=default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my initial discussion about losing the information whether a publication date string could be (fully parsed): your field is nullable, maybe keep it null, when no date could be parsed instead of filling it with a misleading date derived from the created date.

@@ -42,6 +42,7 @@ class Publication(models.Model):
abstract = models.TextField(_('abstract'), blank=True) # [AB]
authors = ArrayField(models.CharField(max_length=255), blank=True, null=True, verbose_name=_('authors')) # [AU]
publication_date = models.CharField(_('publication date'), max_length=255, blank=True, default='') # [PY]
publication_dt = models.DateField(_('publication dt'), null=True, blank=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming scheme is a bit confusing. It would be better to make clear which field contains which data just by the name. I would recommend:

  • renaming the old publication_date CharField to publication_date_raw
  • publication_date for your real DateField

import datefinder


def parse_date(date_str: str, default_date_str: str='1700-01-01') -> dt.date:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to pass the default as a string when every current usage of this method has a real date to hand? Why the default 1700-01-01?

I would remove the default all together and return None when no date could be parsed. The calling code can decide whether it uses a default or does some other stuff then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants