Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Mozilla's Readibility.js #42

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sebastian-nagel
Copy link
Contributor

  • see https://github.com/mozilla/readability
  • enabled by command-line flag --readerView
  • strip the boilerplate from text and HTML (done of a clone of the DOM)
  • extract article metadata (author, etc. - if available)
  • add readable 'article' object to records in pages.jsonl

The readability library is useful to get a clean text for news articles or blog posts, see the example below. Of course, the results are not always perfect.

{
  "id": "...",
  "url": "https://www.theguardian.com/government-computing-network/2011/jul/25/national-archives-web-archiving-project",
  "title": "National Archives pilots council web archiving project | Guardian Government Computing | The Guardian",
  "text": "Advertisement\nNews\nOpinion\nSport\nCulture\nLifestyle\nShow\nMore\nShow More\nNews\nCoronavirus\nWorld news\nUK news\nEnvironment\nScience\nGlobal development\nFootball\nTech\nBusiness\nObituaries\nOpinion\nThe Guardian view\nColumnists\nCartoons\nOpinion videos\nLetters\nSport\nFootball\nCricket\nRugby union\nTennis\nCycling\nF1\nGolf\nUS sports\nCulture\nBooks\nMusic\nTV & radio\nArt & design\nFilm\nGames\nClassical\nStage\nLifestyle\nFashion\nFood\nRecipes\nLove & sex\nHealth & fitness\nHome & garden\nWomen\nMen\nFamily\nTravel\nMoney\nMake a contribution\nSubscribe\nSearch jobs\nHolidays\nDigital Archive\nGuardian Puzzles app\nThe Guardian app\nVideo\nPodcasts\nPictures\nNewsletters\nToday's paper\nInside the Guardian\nThe Observer\nGuardian Weekly\nCrosswords\nSearch jobs\nHolidays\nDigital Archive\nGuardian Puzzles app\nGuardian Government Computing\nThis article is more than\n9 years old\nNational Archives pilots council web archiving project\nThis article is more than 9 years old\nNew project will allow councils to preserve online information\nG\nu\na\nr\nd\ni\na\nn\nG\no\nv\ne\nr\nn\nm\ne\nn\nt\nC\no\nm\np\nu\nt\ni\nn\ng\nMon 25 Jul 2011 12.04 BST\n1\n1\nA web archiving model that allows local authorities to preserve important online information is to be piloted by the National Archives.\nIt will run the pilots ...",
  "article": {
    "title": "National Archives pilots council web archiving project",
    "byline": null,
    "dir": null,
    "content": "<div id=\"readability-page-1\" class=\"page\"><div><p>A web archiving model that allows local authorities to preserve important online information is to be piloted by the National Archives.</p><p>It will run the pilots ...",
    "textContent": "A web archiving model that allows local authorities to preserve important online information is to be piloted by the National Archives.It will run the pilots ...",
    "length": 3414,
    "excerpt": "New project will allow councils to preserve online information",
    "siteName": "the Guardian"
  }
}

crawler.js Outdated
@@ -654,14 +683,20 @@ class Crawler {
if (!fs.existsSync(this.pagesDir)) {
fs.mkdirSync(this.pagesDir);
const header = {"format": "json-pages-1.0", "id": "pages", "title": "All Pages"};
header["hasText"] = this.params.text;
header["hasReaderView"] = this.params.readerView;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding "hasReaderView" how about using "textSource". That way we can reference using readability or the browser dom or 'boilerpipe'. the boilerpipe library lives in py-wacz and we have the dom extraction method used in browsertrix-crawler and in archiveweb.page. This way we can be specific about what method was used for text extraction.

So the current default would be browser-dom and when reader view is set it would instead be 'readability'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can change this. Just to confirm, you mean?

        header["textSource"] = (this.params.readerView ? "readability" : "browser-dom");

Also: by now, the article object is added to a page in case there is the reader view is available, see isProbablyReaderable. If the intention is to replace title and text by the readerable values, shouldn't the property set per page?

@emmadickson
Copy link
Contributor

Would you be willing to add tests and a line in the readme where we detail all available flags?

@sebastian-nagel
Copy link
Contributor Author

Sure, I'll add a test. Is it possible to provide a static page source? Could take https://www.iana.org/domains/reserved (Firefox applies the reader view), but with a remote page source any subtle change in the page content or layout may break the test.

@emmadickson
Copy link
Contributor

@sebastian-nagel sorry for the radio silence. We've been using www.example.com or example.org for tests.

- see https://github.com/mozilla/readability
- if enabled (command-line flag --readerView):
- remove boilerplate from text and HTML
- (if available) extract article metadat (author, etc.)
- add readable 'article' object to page records in pages.jsonl
- indicate in header "textSource" from where the text extract could be
  taken (via readability or via DOM dump)
@sebastian-nagel
Copy link
Contributor Author

@emmadickson: no problem. sorry as well, some time has passed. I've rebased the branch to the current main branch to start work.

We've been using www.example.com or example.org for tests.

Sure. The point is: the page is so simple that there is no "readable" version of it. Only a short and clean text, no boilerplate to strip. So I've added a unit test reading https://www.iana.org/about - no cons if the tests shall be limited to example.com.

Copy link
Contributor

@stavares843 stavares843 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good practice to validate user input and handle any potential errors. In this case, there is no validation of the --readerView flag, and if the user enters an invalid value, the code will still run without showing any warnings or errors. It may be worth adding a check for invalid flag values and show an error message to the user.
The Readability library is loaded from a local file path, which could cause issues if the file is missing or corrupted. It may be more reliable to include Readability as a package dependency and load it through require() instead of reading from a file.
There are some console log statements in the code, which are useful for debugging during development, but they should be removed before the code is deployed to production.
There is a typo in the commit message: "metadata" is misspelled as "metadat".
Other than these minor issues, the code changes look good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants