Skip to content
This repository has been archived by the owner on Nov 24, 2018. It is now read-only.

128 KB AWS IoT message broker limit with .evaluate() result #114

Open
vladholubiev opened this issue Jul 31, 2017 · 16 comments
Open

128 KB AWS IoT message broker limit with .evaluate() result #114

vladholubiev opened this issue Jul 31, 2017 · 16 comments

Comments

@vladholubiev
Copy link
Contributor

I'm using Chromeless to scrape html from websites, and discovered function never returns value if text is too big. I deployed my own serverless project provided in the repo. By trial and error I found it times out if returned html string is larger 131060 bytes.

const chromeless = new Chromeless({ remote: true })

const text = await chromeless
  .goto('https://www.graph.cool')
  .evaluate(() => 'a'.repeat(131061)) // times out, but 131060 works

console.log(text)

await chromeless.end()

Looking up this 'magical' number it seems to have some sense:

image

Is there any internal CDP limitation for 128KiB?

@adieuadieu
Copy link
Collaborator

Hi @vladgolubev. Hm.. I suspect you've run into the 128 KB AWS IoT message broker limit. Not sure about the best solution, but we'll need to figure something out as I can imagine 128 KB won't be enough in many situations..

@joelgriffith
Copy link
Contributor

Is it possible to gzip content?

@vladholubiev
Copy link
Contributor Author

@adieuadieu maybe similar solution as for pdfs/screenshots?

Implement .html() method(#74) which will upload ${cuid()}.html file to S3 bucket?

@adieuadieu
Copy link
Collaborator

I'm thinking something along the lines of breaking up the payload into multiple messages-chunks that get passed around by the MQTT broker—perhaps gzipping them onto of that. We would like to support Azure and GCP in the future, too, so also need to take their equivalent messaging products and their limits into consideration.

@adieuadieu adieuadieu changed the title .evaluate() times out when returning >=131061 bytes, <=131060 works 128 KB AWS IoT message broker limit with .evaluate() result Jul 31, 2017
@adieuadieu
Copy link
Collaborator

adieuadieu commented Jul 31, 2017

@vladgolubev we don't have to worry about the response payload limit (or any APIG limits) since we never respond with anything Chrome-related from the Lambda function's callback(). Currently, everything is communicated between Chromeless and the Proxy (running on Lambda) over MQTT (AWS IoT).

@vladholubiev
Copy link
Contributor Author

@adieuadieu Can 6MB response payload limit for Lambda or 10MB for API Gateway will be an issue later even after splitting? Or chromeless doesn't interact w/ Lambda directly?

@adieuadieu adieuadieu self-assigned this Jul 31, 2017
@vladholubiev
Copy link
Contributor Author

Thanks, now I got it!

Wanted to leave here as a reference how AWS encapsulated a solution for a similar problem - https://aws.amazon.com/about-aws/whats-new/2015/10/now-send-payloads-up-to-2gb-with-amazon-sqs/

But now I see splitting messages is a more generic solution.

Because it may work for html now, but then the same problem will pop up when someone wants to return a large array of URLs or whatever from .evaluate()

@labithiotis
Copy link

@vladgolubev Hi, I am having issues using .html() with size limits as mentioned above.
You mentioned that .html saves to S3 was implemented (${cuid()}.html), however I'm not seeing them in the S3 bucket, do see the .png though.

@vladholubiev
Copy link
Contributor Author

@labithiotis sorry if it was misleading. I only suggested that solution. This size issue is still being resolved by @adieuadieu

@labithiotis
Copy link

@vladgolubev Great to know, but is there anything I could do now to resolve this? Either increase limits or save html?

@joelgriffith
Copy link
Contributor

I think saving the html file is the best solution for the time being. @adieuadieu and @schickling what do you think? .html can return a large payload depending on the page

@schickling
Copy link
Owner

Another option would be to implement message chunking for the websocket connection.

Alternatively, we should make it easier to work with S3 while at the same time decoupling it from APIs like .screenshot etc. WDYT?

@joelgriffith
Copy link
Contributor

I think there's a longer-term task to make chunking happen.. but seems like it is still a ways off. I can also see the case where folks want to persist more than just html to disk (IE: dumps of local-store or other serializable values) in S3.

Maybe the solution is in doing both to a degree:

  • Support chunking for larger messages in WS.
  • Support or refine API's for persisting to S3 (IE: have another API that's more descriptive saveScreenshot and saveHtml)

@labithiotis
Copy link

I adjusted the code to filter through/search over the page dom in evaluate and avoid passing back huge payloads.

@YazzyYaz
Copy link
Contributor

YazzyYaz commented Aug 28, 2017

@joelgriffith @labithiotis I have added a htmlUrl() endpoint on this fork: https://github.com/YazzyYaz/chromeless and it works locally on my computer, returning back a file on my desktop with the html. I'm trying however to test it on AWS Lambda, but my issue is that it doesn't recognize the endpoint after I deploy it. I even configured the package.json to point to the chromeless that is locally modified and it didn't help. Any ideas on what I'm doing wrong?

EDIT: I was doing something stupid, it works on AWS Lambda now :)

@YazzyYaz
Copy link
Contributor

@adieuadieu @joelgriffith PR for this issue: #274

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants