Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publication Date Parsing #14

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
asgiref>=3.3.0
celery==5.0.5
cssselect==1.1.0
datefinder==0.7.3
Django==3.2.16
django-elasticsearch-dsl==7.2.2
django-filter==2.4.0
Expand Down
1 change: 1 addition & 0 deletions swp/admin/publication.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ class PublicationAdmin(admin.ModelAdmin):
'authors',
'abstract',
'publication_date',
'publication_dt',
'last_access',
'url',
'pdf_url',
Expand Down
1 change: 1 addition & 0 deletions swp/api/serializers/publication.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ class Meta:
'authors',
'abstract',
'publication_date',
'publication_dt',
'last_access',
'url',
'pdf_url',
Expand Down
1 change: 1 addition & 0 deletions swp/documents/publication.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ class Django:
'ris_type',
'authors',
'publication_date',
'publication_dt',
'last_access',
'url',
'pdf_url',
Expand Down
7 changes: 7 additions & 0 deletions swp/forms/publication.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
from django.utils.translation import gettext_lazy as _

from swp.models import Publication
from swp.utils.date import parse_date

if TYPE_CHECKING:
from swp.models import Thinktank

Expand Down Expand Up @@ -64,6 +66,11 @@ def clean_authors(self) -> Sequence[str]:

return [clean(author) for author in items]

def clean_publication_dt(self) -> datetime.date:
value = self.truncated_field('publication_date')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary to use a truncated value here. You can directly use self.cleaned_data.

default = f'{datetime.date.today().year}-01-01'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working with dates and datetimes in django consider using the timezone aware utils from django.utils.timezone.

return parse_date(value, default_date_str=default)

def clean(self) -> Mapping[str, Any]:
super().clean()

Expand Down
30 changes: 30 additions & 0 deletions swp/migrations/0040_publication_publication_dt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Generated by Django 3.2.16 on 2023-04-13 14:56

from django.db import migrations, models

from swp.utils.date import parse_date


def migrate_date(apps, schema_editor):
Publication = apps.get_model('swp', 'Publication')

for publication in Publication.objects.all():
default = publication.created.date().isoformat()
publication.publication_dt = parse_date(publication.publication_date, default_date_str=default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my initial discussion about losing the information whether a publication date string could be (fully parsed): your field is nullable, maybe keep it null, when no date could be parsed instead of filling it with a misleading date derived from the created date.

publication.save(update_fields=['publication_dt'])


class Migration(migrations.Migration):

dependencies = [
('swp', '0039_publication_list'),
]

operations = [
migrations.AddField(
model_name='publication',
name='publication_dt',
field=models.DateField(blank=True, null=True, verbose_name='publication dt'),
),
migrations.RunPython(code=migrate_date),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a reverse migration. In this case a migrations.RunPython.noop is sufficient. But with out it, this migration cannot be reverted.

]
1 change: 1 addition & 0 deletions swp/models/publication.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ class Publication(models.Model):
abstract = models.TextField(_('abstract'), blank=True) # [AB]
authors = ArrayField(models.CharField(max_length=255), blank=True, null=True, verbose_name=_('authors')) # [AU]
publication_date = models.CharField(_('publication date'), max_length=255, blank=True, default='') # [PY]
publication_dt = models.DateField(_('publication dt'), null=True, blank=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming scheme is a bit confusing. It would be better to make clear which field contains which data just by the name. I would recommend:

  • renaming the old publication_date CharField to publication_date_raw
  • publication_date for your real DateField

last_access = models.DateTimeField(_('last access'), default=timezone.now, editable=False) # [Y2]
url = LongURLField(_('URL')) # [UR]
pdf_url = LongURLField(_('PDF URL'), blank=True) # [L1]
Expand Down
28 changes: 28 additions & 0 deletions swp/utils/date.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import re
import datetime as dt

import datefinder


def parse_date(date_str: str, default_date_str: str='1700-01-01') -> dt.date:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to pass the default as a string when every current usage of this method has a real date to hand? Why the default 1700-01-01?

I would remove the default all together and return None when no date could be parsed. The calling code can decide whether it uses a default or does some other stuff then.

date_str = ' '.join(date_str.split())
default_date = dt.datetime.fromisoformat(default_date_str).date()
if not date_str:
return default_date
try:
return dt.datetime.fromisoformat(date_str).date()
except ValueError:
pass
year = default_date.year
years_found = re.findall(r'(19\d{2}|20\d{2})', date_str)
if years_found:
year = int(sorted(years_found)[0])
if year != default_date.year:
default_date = dt.date(year, 1, 1)
dates_found = sorted([d.date() for d in datefinder.find_dates(date_str)])
date = dates_found[0] if dates_found else None
if date is None:
return default_date
if years_found and date.year != default_date.year:
return default_date
return date