- FAQ
- Demo
- UMLs
- Objetive
- Documentation
- Usage Limitations
- Contributions
- Contributing Guidelines
- Code Of Conduct
This application not only allows you to automate API creation
, analyze and compare
various websites, and generate insightful reports, but also enables you to export
the obtained information in Excel format
and retrieve specific results using keywords. The main goal of this web scraper is to gather information from any website
by utilizing its URL and the designated target CSS class. Its adaptable design empowers you to collect data from your preferred sites without being constrained by predefined limits."
The code makes a POST
request to the /scrappe
endpoint at https://scraper-5ask.onrender.com/api/v1
. The request body should contain the following parameters:
keyWord
(string): The keyword to filter articles by (optional
).url
(string): The URL of the web page to scrape (mandatory
).objectClass
(string): The CSS class of the elements to scrape from the web page (mandatory
).
The API endpoint responds with a JSON object containing the following properties:
state
: A string indicating the state of the scraping process.objects found
: The number of objects found after filtering.key-word
: The keyword used for filtering.scanned webpage
: The URL of the webpage that was scraped.found articles
: An array of articles that match the filtering criteria.
if the response is too big the api usecompression
middleware to reduce the size.- implementing
findOrCreate
method for mongoose is a powerful tool to ensure that the scraping of websites doesn't lead to duplicated results in the database.
{
"url":"https://www.url.com.ar",
"objectClass":".css-class-selector",
"keyWord":"keyword"
}
{
"state": "success",
"objects found": 2,
"key-word": {
"doc": {
"_id": "64d40fa677d90019c57302ed",
"keyword": "keyword",
"createdAt": "2023-08-09T22:13:58.108Z",
"updatedAt": "2023-08-10T17:08:08.459Z",
"__v": 0,
"usedTimes": 28
},
"created": false
},
"scanned webpage": {
"_id": "64d3e3459686e7f4087acfdb",
"cssClass": ".css-class-selector",
"url": "https://www.url.com.ar",
"__v": 0,
"createdAt": "2023-08-09T19:04:37.137Z",
"scrapedTimes": 69,
"updatedAt": "2023-08-10T17:08:08.328Z"
},
"found articles": [
{
"_id": "64d4fcf821aef9f1dd17bbb8",
"websiteTarget": "64d3e3459686e7f4087acfdb",
"keywords": [
"64d40fa677d90019c57302ed"
],
"title": "Some Title",
"link": "/some/link/related/to/the/article",
"createdAt": "2023-08-10T15:06:32.535Z",
"updatedAt": "2023-08-10T17:08:08.643Z",
"__v": 2
},
]
}
- Make a
Post
request to/export/to-excel
- The request body should contain the following parameters:
scanned webpage
(Object): Response for/scrappe
(mandatory
)found articles
(Objects Array): Response for/scrappe
(mandatory
).
body example:
{
"scanned webpage": {
"_id": "64d3e3459686e7f4087acfdb",
"cssClass": ".css-class-selector",
"url": "https://www.url.com.ar",
"__v": 0,
"createdAt": "2023-08-09T19:04:37.137Z",
"scrapedTimes": 69,
"updatedAt": "2023-08-10T17:08:08.328Z"
},
"found articles":[
{
"_id": "64d4fcf821aef9f1dd17bbb8",
"websiteTarget": "64d3e3459686e7f4087acfdb",
"keywords": [
"64d40fa677d90019c57302ed"
],
"title": "Some Title",
"link": "/some/link/related/to/the/article",
"createdAt": "2023-08-10T15:06:32.535Z",
"updatedAt": "2023-08-10T17:08:08.643Z",
"__v": 2
}
]
}
- You can only send up to 100 requests per 10 minutes.
- If the webpage has incorrect element nesting, the scraper will fail
- before use this tool please read FAQ
Especial thanks to:
π Frontend | π Designers |
---|---|
@Robertw8 | @LorenaGambarrota |
@conorvenus | |
@sudeepmahato16 | |
@2div | |
@PraveenShinde3 | |
@Rayen-Allaya | |
@Piyush-Desmukh | |
@Bolaji06 |
- Contributions are welcome! please read our guidelines