Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce paragraph count summary #2761

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

gagath
Copy link
Contributor

@gagath gagath commented May 28, 2020

Let the users use only the n-first paragraphs of the article as article
summary.

The advantage of this approach is that we avoid the random word count
ellipsis that will cut content in pieces, while not having to copy the
first paragraph of the article into the article's summary metadata.

Pull Request Checklist

  • Ensured tests pass and (if applicable) updated functional test output
  • Conformed to code style guidelines by running appropriate linting tools
  • Added tests for changed code
  • Updated documentation for changed code

@boxydog
Copy link
Contributor

boxydog commented Oct 29, 2023

The functionality (paragraphs) seems reasonable. Using negative and positive SUMMARY_MAX_LENGTH feels confusing. I'd introduce a different config var. But, my opinion is not binding.

@boxydog
Copy link
Contributor

boxydog commented Oct 29, 2023

Less confusing would be SUMMARY_MAX_LENGTH (for words) and SUMMARY_MAX_PARAGRAPHS (for paragraphs).

If both are set, could 1) apply both, or 2) pick a winner (e.g., SUMMARY_MAX_PARAGRAPHS), or 3) raise an error, or 4) hidden option 4.

@paulocoutinhox
Copy link

+1

pelican/utils.py Outdated
for i in range(count):
substr = substr[tag_stop:]
tag_start = substr.find('<p>')
tag_stop = substr.find('</p>') + len('</p>')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My time on the interwebs taught me that the possibility of finding a closing tag for <p> is like a coin toss. Even the HTML spec is permissive about it and "formally" allows omitting </p> in most cases. Long story short, I am trying to say that this is quite unreliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML is mostly generated from a generator (reST, Markdown…) that hopefully emits closing tags, so in practice this should not happen very much.

Copy link
Member

@avaris avaris Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in practice this should not happen very much.

That doesn't cut it though :). rst and md generators may play nice (for now) but content can come from any number for sources not limited to those, including direct html files, plugins for other formats etc. In software, edge cases happen more common than one would think initially :).

I'm not suggesting employing full DOM parsers like lxml or similar, although that would be more complete. But that can be postponed until the need arises. It should at least produce something reasonable in those cases: i.e. if </p> is not present (substr.find('</p>') == -1), take up to the next <p>:

tag_stop = substr.find('</p>')
if tag_stop == -1:  # no closing p
    tag_stop = substr.find('<p>', tag_start + len('<p>'))
    #  if tag_stop == -1:  # if there are no more <p>s take everything else?
    #      tag_stop = None
else:
    tag_stop += len('</p>')

@gagath gagath changed the title [WIP] Introduce paragraph count summary Introduce paragraph count summary Nov 17, 2023
@gagath
Copy link
Contributor Author

gagath commented Nov 17, 2023

  • Rebased
  • Added new setting SUMMARY_MAX_PARAGRAPHS
  • Added a new test to verify what happens when both SUMMARY_MAX_PARAGRAPHS and SUMMARY_MAX_LENGTH are set
  • Fixed formatting issues
  • Added missing documentation

docs/settings.rst Outdated Show resolved Hide resolved
pelican/contents.py Outdated Show resolved Hide resolved
Let the users use only the n-first paragraphs of the article as article
summary.

The advantage of this approach is that we avoid the random word count
ellipsis that will cut content in pieces, while not having to copy the
first paragraph of the article into the article's summary metadata.

If both SUMMARY_MAX_PARAGRAPHS and SUMMARY_MAX_LENGTH are set, then
the SUMMARY_MAX_LENGTH option will apply to the number of paragraphs in
SUMMARY_MAX_PARAGRAPHS.
@gagath
Copy link
Contributor Author

gagath commented Nov 18, 2023

@avaris Thanks for your comments, fixed them.

Copy link
Member

@justinmayer justinmayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks to @gagath for the enhancement as well as for your patience.

Thanks also to @boxydog and @avaris for reviewing!

@justinmayer justinmayer merged commit 513abbf into getpelican:main Jun 25, 2024
16 checks passed
@gagath gagath deleted the paragraph-summary branch August 4, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants