-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publication Date Parsing #14
base: master
Are you sure you want to change the base?
Conversation
Hi @RustySheep, thanks for your well prepared submission. I had a look at your implementation and will add line comments later. But first I would discuss this topic on a more general level. I already had my hands on this topic and played around with the datefinder library a while ago. I don't know in which extent you had access to the real data from the production DB (I had) but mit findings where that this library and approach isn't really suitable for our data. In many cases we have incomplete "date data". The publication field data contains texts like "May 2022" or "5 months ago", which can't be parsed to "complete" dates. There are two solutions for this problem in my opinion:
I prefer the second solution. But this depends on the needs of the application. I wrote a date parsing library on my own which is tailored to the data I found in the live database (as of June 22). You can have a look at it in MR #15. First I use the tools django has built in which use very specific regex patterns and then try to match some own regex patterns from really specific ones to more fractional ones. With this approach I was able to match every string in the database - even "5 months ago", which needs to be resolved to a date by using the created date in a second step. The decision how to handle the data really depends on the requirement and what the data should be used for. Have you run your code against live data? |
@@ -64,6 +66,11 @@ def clean_authors(self) -> Sequence[str]: | |||
|
|||
return [clean(author) for author in items] | |||
|
|||
def clean_publication_dt(self) -> datetime.date: | |||
value = self.truncated_field('publication_date') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary to use a truncated value here. You can directly use self.cleaned_data
.
@@ -64,6 +66,11 @@ def clean_authors(self) -> Sequence[str]: | |||
|
|||
return [clean(author) for author in items] | |||
|
|||
def clean_publication_dt(self) -> datetime.date: | |||
value = self.truncated_field('publication_date') | |||
default = f'{datetime.date.today().year}-01-01' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When working with dates and datetimes in django consider using the timezone aware utils from django.utils.timezone
.
name='publication_dt', | ||
field=models.DateField(blank=True, null=True, verbose_name='publication dt'), | ||
), | ||
migrations.RunPython(code=migrate_date), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a reverse migration. In this case a migrations.RunPython.noop
is sufficient. But with out it, this migration cannot be reverted.
|
||
for publication in Publication.objects.all(): | ||
default = publication.created.date().isoformat() | ||
publication.publication_dt = parse_date(publication.publication_date, default_date_str=default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my initial discussion about losing the information whether a publication date string could be (fully parsed): your field is nullable, maybe keep it null, when no date could be parsed instead of filling it with a misleading date derived from the created date.
@@ -42,6 +42,7 @@ class Publication(models.Model): | |||
abstract = models.TextField(_('abstract'), blank=True) # [AB] | |||
authors = ArrayField(models.CharField(max_length=255), blank=True, null=True, verbose_name=_('authors')) # [AU] | |||
publication_date = models.CharField(_('publication date'), max_length=255, blank=True, default='') # [PY] | |||
publication_dt = models.DateField(_('publication dt'), null=True, blank=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming scheme is a bit confusing. It would be better to make clear which field contains which data just by the name. I would recommend:
- renaming the old
publication_date
CharField topublication_date_raw
publication_date
for your real DateField
import datefinder | ||
|
||
|
||
def parse_date(date_str: str, default_date_str: str='1700-01-01') -> dt.date: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you decide to pass the default as a string when every current usage of this method has a real date to hand? Why the default 1700-01-01
?
I would remove the default all together and return None
when no date could be parsed. The calling code can decide whether it uses a default or does some other stuff then.
Description
This pull request adds a new
publication_dt
field to thePublication
model in thewebmonitor
app. Thepublication_dt
field is aDateField
that represents the date the publication was published.In order to populate the
publication_dt
field, thedatefinder
package was added as a requirement, and a new function was implemented to parse the publication date from the existingpublication_date
field.Additionally, a new migration was prepared to add the
publication_dt
field to the database schema. A newclean_publication_dt
method was added to theScrapedPublicationForm
to validate thepublication_dt
field. Finally, thepublication_dt
field was registered with thePublicationSerializer
to ensure that it is included in the REST API.Changes Made
datefinder
to the list of requirements inrequirements.txt
.parse_date
inutils/date.py
to parse the publication date from thepublication_date
field.publication_dt
field to thePublication
model inmodels/publication.py
.publication_dt
field to the database schema.clean_publication_dt
method to theScrapedPublicationForm
informs/publication.py
to validate thepublication_dt
field.publication_dt
field with thePublicationSerializer
inapi/serializers/publication.py
to ensure that it is included in the REST API. Also registered atPublicationAdmin
inadmin/publication.py
andPublicationDocument
indocuments/publication.py
Testing
I tested these changes on my local development environment and verified that the
publication_dt
field is added to thePublication
model, theclean_publication_dt
method validates thepublication_dt
field correctly, and thepublication_dt
field is included in the REST API.I also verified that the new migration can be applied successfully using the
python manage.py migrate
command.