-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce paragraph count summary #2761
Conversation
d7935a4
to
298dd8c
Compare
The functionality (paragraphs) seems reasonable. Using negative and positive SUMMARY_MAX_LENGTH feels confusing. I'd introduce a different config var. But, my opinion is not binding. |
Less confusing would be SUMMARY_MAX_LENGTH (for words) and SUMMARY_MAX_PARAGRAPHS (for paragraphs). If both are set, could 1) apply both, or 2) pick a winner (e.g., SUMMARY_MAX_PARAGRAPHS), or 3) raise an error, or 4) hidden option 4. |
+1 |
pelican/utils.py
Outdated
for i in range(count): | ||
substr = substr[tag_stop:] | ||
tag_start = substr.find('<p>') | ||
tag_stop = substr.find('</p>') + len('</p>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My time on the interwebs taught me that the possibility of finding a closing tag for <p>
is like a coin toss. Even the HTML spec is permissive about it and "formally" allows omitting </p>
in most cases. Long story short, I am trying to say that this is quite unreliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTML is mostly generated from a generator (reST, Markdown…) that hopefully emits closing tags, so in practice this should not happen very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in practice this should not happen very much.
That doesn't cut it though :). rst
and md
generators may play nice (for now) but content can come from any number for sources not limited to those, including direct html
files, plugins for other formats etc. In software, edge cases happen more common than one would think initially :).
I'm not suggesting employing full DOM parsers like lxml
or similar, although that would be more complete. But that can be postponed until the need arises. It should at least produce something reasonable in those cases: i.e. if </p>
is not present (substr.find('</p>') == -1
), take up to the next <p>
:
tag_stop = substr.find('</p>')
if tag_stop == -1: # no closing p
tag_stop = substr.find('<p>', tag_start + len('<p>'))
# if tag_stop == -1: # if there are no more <p>s take everything else?
# tag_stop = None
else:
tag_stop += len('</p>')
298dd8c
to
1c044ed
Compare
|
1c044ed
to
d01c5b7
Compare
Let the users use only the n-first paragraphs of the article as article summary. The advantage of this approach is that we avoid the random word count ellipsis that will cut content in pieces, while not having to copy the first paragraph of the article into the article's summary metadata. If both SUMMARY_MAX_PARAGRAPHS and SUMMARY_MAX_LENGTH are set, then the SUMMARY_MAX_LENGTH option will apply to the number of paragraphs in SUMMARY_MAX_PARAGRAPHS.
d01c5b7
to
fd029b2
Compare
@avaris Thanks for your comments, fixed them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let the users use only the n-first paragraphs of the article as article
summary.
The advantage of this approach is that we avoid the random word count
ellipsis that will cut content in pieces, while not having to copy the
first paragraph of the article into the article's summary metadata.
Pull Request Checklist