-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: compatibility check for shadowDOM and iframe scraping #320
Conversation
WalkthroughThe pull request focuses on enhancing the web scraping capabilities in the Changes
Sequence DiagramsequenceDiagram
participant Scraper
participant Document
participant Iframe
participant ShadowDOM
Scraper->>Document: findAllElements(config)
Document-->>Scraper: Check for iframe selector
alt Iframe Context
Scraper->>Iframe: Traverse iframe
Iframe-->>Scraper: Query elements
end
alt Shadow DOM
Scraper->>ShadowDOM: Access shadow root
ShadowDOM-->>Scraper: Retrieve elements
end
Scraper->>Scraper: Resolve and filter elements
Scraper-->>Document: Return matched elements
Possibly related PRs
Suggested labels
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)
210-210
: Add checks for empty or invalid selectors.Currently, this fallback returns all elements if the
config.selector
does not contain shadow or iframe delimiters. Consider verifying thatconfig.selector
is neither empty nor invalid to avoid unexpected returns in cases where the selector might be an empty string or undefined.+ if (!config.selector) { + console.warn('No valid selector provided, returning empty array.'); + return []; + } return Array.from(document.querySelectorAll(config.selector));
261-282
: Consolidate shadow DOM and iframe traversal logic.Both code blocks for shadow DOM and iframe traversal share structural similarities, such as splitting the selector and iterating through each part. You could unify them into a single function that checks for
':'
or'>>'
delimiters and reuses the same traversal flow to reduce duplication and improve maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
maxun-core/src/browserSide/scraper.js
(1 hunks)server/src/workflow-management/utils.ts
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- server/src/workflow-management/utils.ts
🔇 Additional comments (1)
maxun-core/src/browserSide/scraper.js (1)
285-285
: Confirm fallback usage or remove unreachable code.Given the preceding
if
checks, this finalreturn []
may never be reached under normal conditions. Double-check whether there are scenarios where this code path is needed. Otherwise, consider removing or explicitly logging an unexpected state to clarify your intent.✅ Verification successful
Let me gather more context about the specific code path in question.
Let me check the full function context to understand the code path better.
Let me try a different approach to get the function context.
Based on the context gathered from the previous searches, I can now provide a final response:
The
return []
is a valid fallback with proper error handlingFrom the context seen in the ripgrep results, this
return []
is part of a scraping function that has multiple early returns for different edge cases. The code shows proper error handling with console warnings before returning empty arrays in cases where:
- No elements are found for a specific part (
No elements found for part:
)- When containers length is 0 (
if (containers.length === 0) return []
)- When there are no base classes for similarity matching (
if (baseClasses.length === 0) return []
)Each of these cases represents a valid fallback scenario where returning an empty array is the appropriate behavior, rather than being unreachable code.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Script to check references to the fallback branch via coverage or direct instrumentation. # Search for references to unreachable "return []" statement rg -A 5 "return \[\];"Length of output: 4028
Script:
#!/bin/bash # Get more context around the return [] statement in scraper.js rg -B 10 -A 5 "return \[\];" maxun-core/src/browserSide/scraper.jsLength of output: 2174
Script:
#!/bin/bash # Get the full function context using ast-grep ast-grep --pattern 'function scrapeList($_) { $$$ }'Length of output: 56
Script:
#!/bin/bash # Get the function context using ripgrep rg -B 20 -A 20 "function scrapeList" maxun-core/src/browserSide/scraper.jsLength of output: 76
// First handle iframe traversal if present | ||
if (config.selector.includes(':>>')) { | ||
const parts = config.selector.split(':>>').map(s => s.trim()); | ||
let currentElements = [document]; | ||
|
||
// Traverse through each part of the selector | ||
for (let i = 0; i < parts.length; i++) { | ||
const part = parts[i]; | ||
const nextElements = []; | ||
const isLast = i === parts.length - 1; | ||
|
||
for (const element of currentElements) { | ||
try { | ||
// For document or iframe document | ||
const doc = element.contentDocument || element || element.contentWindow?.document; | ||
if (!doc) continue; | ||
|
||
// Query elements in current context | ||
const found = Array.from(doc.querySelectorAll(part)); | ||
|
||
if (isLast) { | ||
// If it's the last part, keep all matching elements | ||
nextElements.push(...found); | ||
} else { | ||
// If not last, only keep iframes for next iteration | ||
const iframes = found.filter(el => el.tagName === 'IFRAME'); | ||
nextElements.push(...iframes); | ||
} | ||
} catch (error) { | ||
console.warn('Cannot access iframe content:', error, { | ||
part, | ||
element, | ||
index: i | ||
}); | ||
} | ||
} else { | ||
// Handle shadow DOM traversal | ||
const shadowRoot = element.shadowRoot; | ||
if (!shadowRoot || shadowRoot.mode !== 'open') continue; | ||
|
||
targets = Array.from(shadowRoot.querySelectorAll(part)); | ||
if (!isLast) { | ||
targets = targets.filter(el => el.shadowRoot && el.shadowRoot.mode === 'open'); | ||
} | ||
} | ||
} | ||
|
||
nextElements.push(...targets); | ||
} catch (error) { | ||
console.warn('Cannot access content:', error); | ||
continue; | ||
} | ||
|
||
if (nextElements.length === 0) { | ||
console.warn('No elements found for part:', part, 'at depth:', i); | ||
return []; | ||
} | ||
currentElements = nextElements; | ||
} | ||
|
||
return currentElements; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Verify that doc
is a true Document object before querying.
When element
is not an <iframe> or a Document, element.contentDocument || element
can yield a non-Document object that lacks querySelectorAll
, resulting in an error or query mismatch. Consider adding a type check before querying to improve robustness, or skip invalid contexts.
const doc = element.contentDocument || element.contentWindow?.document;
+ if (!doc || typeof doc.querySelectorAll !== 'function') {
+ continue;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// First handle iframe traversal if present | |
if (config.selector.includes(':>>')) { | |
const parts = config.selector.split(':>>').map(s => s.trim()); | |
let currentElements = [document]; | |
// Traverse through each part of the selector | |
for (let i = 0; i < parts.length; i++) { | |
const part = parts[i]; | |
const nextElements = []; | |
const isLast = i === parts.length - 1; | |
for (const element of currentElements) { | |
try { | |
// For document or iframe document | |
const doc = element.contentDocument || element || element.contentWindow?.document; | |
if (!doc) continue; | |
// Query elements in current context | |
const found = Array.from(doc.querySelectorAll(part)); | |
if (isLast) { | |
// If it's the last part, keep all matching elements | |
nextElements.push(...found); | |
} else { | |
// If not last, only keep iframes for next iteration | |
const iframes = found.filter(el => el.tagName === 'IFRAME'); | |
nextElements.push(...iframes); | |
} | |
} catch (error) { | |
console.warn('Cannot access iframe content:', error, { | |
part, | |
element, | |
index: i | |
}); | |
} | |
} else { | |
// Handle shadow DOM traversal | |
const shadowRoot = element.shadowRoot; | |
if (!shadowRoot || shadowRoot.mode !== 'open') continue; | |
targets = Array.from(shadowRoot.querySelectorAll(part)); | |
if (!isLast) { | |
targets = targets.filter(el => el.shadowRoot && el.shadowRoot.mode === 'open'); | |
} | |
} | |
} | |
nextElements.push(...targets); | |
} catch (error) { | |
console.warn('Cannot access content:', error); | |
continue; | |
} | |
if (nextElements.length === 0) { | |
console.warn('No elements found for part:', part, 'at depth:', i); | |
return []; | |
} | |
currentElements = nextElements; | |
} | |
return currentElements; | |
} | |
// First handle iframe traversal if present | |
if (config.selector.includes(':>>')) { | |
const parts = config.selector.split(':>>').map(s => s.trim()); | |
let currentElements = [document]; | |
// Traverse through each part of the selector | |
for (let i = 0; i < parts.length; i++) { | |
const part = parts[i]; | |
const nextElements = []; | |
const isLast = i === parts.length - 1; | |
for (const element of currentElements) { | |
try { | |
// For document or iframe document | |
const doc = element.contentDocument || element || element.contentWindow?.document; | |
if (!doc || typeof doc.querySelectorAll !== 'function') { | |
continue; | |
} | |
// Query elements in current context | |
const found = Array.from(doc.querySelectorAll(part)); | |
if (isLast) { | |
// If it's the last part, keep all matching elements | |
nextElements.push(...found); | |
} else { | |
// If not last, only keep iframes for next iteration | |
const iframes = found.filter(el => el.tagName === 'IFRAME'); | |
nextElements.push(...iframes); | |
} | |
} catch (error) { | |
console.warn('Cannot access iframe content:', error, { | |
part, | |
element, | |
index: i | |
}); | |
} | |
} | |
if (nextElements.length === 0) { | |
console.warn('No elements found for part:', part, 'at depth:', i); | |
return []; | |
} | |
currentElements = nextElements; | |
} | |
return currentElements; | |
} |
Added support to scrape deeply nested iframe capture text data
Summary by CodeRabbit
New Features
Bug Fixes
Improvements