Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: compatibility check for shadowDOM and iframe scraping #320

Merged
merged 2 commits into from
Jan 4, 2025

Conversation

RohitR311
Copy link
Collaborator

@RohitR311 RohitR311 commented Jan 4, 2025

Added support to scrape deeply nested iframe capture text data

Summary by CodeRabbit

  • New Features

    • Enhanced web scraping capabilities with improved handling of complex web structures
    • Added support for traversing iframes and shadow DOM elements
    • Improved context-aware URL resolution for element attributes
  • Bug Fixes

    • Refined element querying logic to handle more complex web page structures
    • Improved error handling for inaccessible iframe content and missing elements
  • Improvements

    • Streamlined DOM tree traversal
    • Enhanced robustness of scraping functions across different web contexts

Copy link

coderabbitai bot commented Jan 4, 2025

Walkthrough

The pull request focuses on enhancing the web scraping capabilities in the scraper.js file, specifically improving the handling of complex web page structures involving iframes and shadow DOMs. The modifications primarily target the findAllElements and getElementValue functions, introducing more robust logic for traversing and querying elements across different document contexts. The changes aim to provide more comprehensive and flexible element selection mechanisms, with improved error handling and context-aware URL resolution.

Changes

File Change Summary
maxun-core/src/browserSide/scraper.js - Enhanced findAllElements function for iframe and shadow DOM traversal
- Improved element querying logic with better error handling
- Updated getElementValue for context-aware URL resolution
- Modified scrapeList and scrapeListAuto to support new querying mechanisms
server/src/workflow-management/utils.ts - Added a blank line in getBestSelectorForAction function for improved readability

Sequence Diagram

sequenceDiagram
    participant Scraper
    participant Document
    participant Iframe
    participant ShadowDOM

    Scraper->>Document: findAllElements(config)
    Document-->>Scraper: Check for iframe selector
    alt Iframe Context
        Scraper->>Iframe: Traverse iframe
        Iframe-->>Scraper: Query elements
    end
    alt Shadow DOM
        Scraper->>ShadowDOM: Access shadow root
        ShadowDOM-->>Scraper: Retrieve elements
    end
    Scraper->>Scraper: Resolve and filter elements
    Scraper-->>Document: Return matched elements
Loading

Possibly related PRs

Suggested labels

Type: Feature, Status: Work In Progess

Poem

🐰 Hopping through the DOM's wild maze,
Iframes and shadows no longer faze!
With selectors sharp and logic bright,
Our scraper dances with pure delight 🌐
Traversing contexts with rabbit might! 🚀


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@RohitR311 RohitR311 merged commit 5058a3b into develop Jan 4, 2025
1 check was pending
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)

210-210: Add checks for empty or invalid selectors.

Currently, this fallback returns all elements if the config.selector does not contain shadow or iframe delimiters. Consider verifying that config.selector is neither empty nor invalid to avoid unexpected returns in cases where the selector might be an empty string or undefined.

+ if (!config.selector) {
+   console.warn('No valid selector provided, returning empty array.');
+   return [];
+ }
return Array.from(document.querySelectorAll(config.selector));

261-282: Consolidate shadow DOM and iframe traversal logic.

Both code blocks for shadow DOM and iframe traversal share structural similarities, such as splitting the selector and iterating through each part. You could unify them into a single function that checks for ':' or '>>' delimiters and reuses the same traversal flow to reduce duplication and improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c132f27 and 9c8a980.

📒 Files selected for processing (2)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
  • server/src/workflow-management/utils.ts (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • server/src/workflow-management/utils.ts
🔇 Additional comments (1)
maxun-core/src/browserSide/scraper.js (1)

285-285: Confirm fallback usage or remove unreachable code.

Given the preceding if checks, this final return [] may never be reached under normal conditions. Double-check whether there are scenarios where this code path is needed. Otherwise, consider removing or explicitly logging an unexpected state to clarify your intent.

✅ Verification successful

Let me gather more context about the specific code path in question.


Let me check the full function context to understand the code path better.


Let me try a different approach to get the function context.


Based on the context gathered from the previous searches, I can now provide a final response:

The return [] is a valid fallback with proper error handling

From the context seen in the ripgrep results, this return [] is part of a scraping function that has multiple early returns for different edge cases. The code shows proper error handling with console warnings before returning empty arrays in cases where:

  • No elements are found for a specific part (No elements found for part:)
  • When containers length is 0 (if (containers.length === 0) return [])
  • When there are no base classes for similarity matching (if (baseClasses.length === 0) return [])

Each of these cases represents a valid fallback scenario where returning an empty array is the appropriate behavior, rather than being unreachable code.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Script to check references to the fallback branch via coverage or direct instrumentation.

# Search for references to unreachable "return []" statement
rg -A 5 "return \[\];"

Length of output: 4028


Script:

#!/bin/bash
# Get more context around the return [] statement in scraper.js
rg -B 10 -A 5 "return \[\];" maxun-core/src/browserSide/scraper.js

Length of output: 2174


Script:

#!/bin/bash
# Get the full function context using ast-grep
ast-grep --pattern 'function scrapeList($_) {
  $$$
}'

Length of output: 56


Script:

#!/bin/bash
# Get the function context using ripgrep
rg -B 20 -A 20 "function scrapeList" maxun-core/src/browserSide/scraper.js

Length of output: 76

Comment on lines +213 to +259
// First handle iframe traversal if present
if (config.selector.includes(':>>')) {
const parts = config.selector.split(':>>').map(s => s.trim());
let currentElements = [document];

// Traverse through each part of the selector
for (let i = 0; i < parts.length; i++) {
const part = parts[i];
const nextElements = [];
const isLast = i === parts.length - 1;

for (const element of currentElements) {
try {
// For document or iframe document
const doc = element.contentDocument || element || element.contentWindow?.document;
if (!doc) continue;

// Query elements in current context
const found = Array.from(doc.querySelectorAll(part));

if (isLast) {
// If it's the last part, keep all matching elements
nextElements.push(...found);
} else {
// If not last, only keep iframes for next iteration
const iframes = found.filter(el => el.tagName === 'IFRAME');
nextElements.push(...iframes);
}
} catch (error) {
console.warn('Cannot access iframe content:', error, {
part,
element,
index: i
});
}
} else {
// Handle shadow DOM traversal
const shadowRoot = element.shadowRoot;
if (!shadowRoot || shadowRoot.mode !== 'open') continue;

targets = Array.from(shadowRoot.querySelectorAll(part));
if (!isLast) {
targets = targets.filter(el => el.shadowRoot && el.shadowRoot.mode === 'open');
}
}
}

nextElements.push(...targets);
} catch (error) {
console.warn('Cannot access content:', error);
continue;
}

if (nextElements.length === 0) {
console.warn('No elements found for part:', part, 'at depth:', i);
return [];
}
currentElements = nextElements;
}

return currentElements;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Verify that doc is a true Document object before querying.

When element is not an <iframe> or a Document, element.contentDocument || element can yield a non-Document object that lacks querySelectorAll, resulting in an error or query mismatch. Consider adding a type check before querying to improve robustness, or skip invalid contexts.

const doc = element.contentDocument || element.contentWindow?.document;
+ if (!doc || typeof doc.querySelectorAll !== 'function') {
+   continue;
+ }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// First handle iframe traversal if present
if (config.selector.includes(':>>')) {
const parts = config.selector.split(':>>').map(s => s.trim());
let currentElements = [document];
// Traverse through each part of the selector
for (let i = 0; i < parts.length; i++) {
const part = parts[i];
const nextElements = [];
const isLast = i === parts.length - 1;
for (const element of currentElements) {
try {
// For document or iframe document
const doc = element.contentDocument || element || element.contentWindow?.document;
if (!doc) continue;
// Query elements in current context
const found = Array.from(doc.querySelectorAll(part));
if (isLast) {
// If it's the last part, keep all matching elements
nextElements.push(...found);
} else {
// If not last, only keep iframes for next iteration
const iframes = found.filter(el => el.tagName === 'IFRAME');
nextElements.push(...iframes);
}
} catch (error) {
console.warn('Cannot access iframe content:', error, {
part,
element,
index: i
});
}
} else {
// Handle shadow DOM traversal
const shadowRoot = element.shadowRoot;
if (!shadowRoot || shadowRoot.mode !== 'open') continue;
targets = Array.from(shadowRoot.querySelectorAll(part));
if (!isLast) {
targets = targets.filter(el => el.shadowRoot && el.shadowRoot.mode === 'open');
}
}
}
nextElements.push(...targets);
} catch (error) {
console.warn('Cannot access content:', error);
continue;
}
if (nextElements.length === 0) {
console.warn('No elements found for part:', part, 'at depth:', i);
return [];
}
currentElements = nextElements;
}
return currentElements;
}
// First handle iframe traversal if present
if (config.selector.includes(':>>')) {
const parts = config.selector.split(':>>').map(s => s.trim());
let currentElements = [document];
// Traverse through each part of the selector
for (let i = 0; i < parts.length; i++) {
const part = parts[i];
const nextElements = [];
const isLast = i === parts.length - 1;
for (const element of currentElements) {
try {
// For document or iframe document
const doc = element.contentDocument || element || element.contentWindow?.document;
if (!doc || typeof doc.querySelectorAll !== 'function') {
continue;
}
// Query elements in current context
const found = Array.from(doc.querySelectorAll(part));
if (isLast) {
// If it's the last part, keep all matching elements
nextElements.push(...found);
} else {
// If not last, only keep iframes for next iteration
const iframes = found.filter(el => el.tagName === 'IFRAME');
nextElements.push(...iframes);
}
} catch (error) {
console.warn('Cannot access iframe content:', error, {
part,
element,
index: i
});
}
}
if (nextElements.length === 0) {
console.warn('No elements found for part:', part, 'at depth:', i);
return [];
}
currentElements = nextElements;
}
return currentElements;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants