feat: compatibility check for shadowDOM and iframe scraping #320

RohitR311 · 2025-01-04T20:02:17Z

Added support to scrape deeply nested iframe capture text data

Summary by CodeRabbit

New Features
- Enhanced web scraping capabilities with improved handling of complex web structures
- Added support for traversing iframes and shadow DOM elements
- Improved context-aware URL resolution for element attributes
Bug Fixes
- Refined element querying logic to handle more complex web page structures
- Improved error handling for inaccessible iframe content and missing elements
Improvements
- Streamlined DOM tree traversal
- Enhanced robustness of scraping functions across different web contexts

coderabbitai · 2025-01-04T20:02:24Z

Walkthrough

The pull request focuses on enhancing the web scraping capabilities in the scraper.js file, specifically improving the handling of complex web page structures involving iframes and shadow DOMs. The modifications primarily target the findAllElements and getElementValue functions, introducing more robust logic for traversing and querying elements across different document contexts. The changes aim to provide more comprehensive and flexible element selection mechanisms, with improved error handling and context-aware URL resolution.

Changes

File	Change Summary
`maxun-core/src/browserSide/scraper.js`	- Enhanced `findAllElements` function for iframe and shadow DOM traversal - Improved element querying logic with better error handling - Updated `getElementValue` for context-aware URL resolution - Modified `scrapeList` and `scrapeListAuto` to support new querying mechanisms
`server/src/workflow-management/utils.ts`	- Added a blank line in `getBestSelectorForAction` function for improved readability

Sequence Diagram

sequenceDiagram
    participant Scraper
    participant Document
    participant Iframe
    participant ShadowDOM

    Scraper->>Document: findAllElements(config)
    Document-->>Scraper: Check for iframe selector
    alt Iframe Context
        Scraper->>Iframe: Traverse iframe
        Iframe-->>Scraper: Query elements
    end
    alt Shadow DOM
        Scraper->>ShadowDOM: Access shadow root
        ShadowDOM-->>Scraper: Retrieve elements
    end
    Scraper->>Scraper: Resolve and filter elements
    Scraper-->>Document: Return matched elements

Possibly related PRs

feat: support iframe extraction #275: Enhances scraper.js to improve iframe and shadow DOM handling
feat: shadow dom selection #296: Introduces improvements to shadow DOM selection in scraper.js

Suggested labels

Type: Feature, Status: Work In Progess

Poem

🐰 Hopping through the DOM's wild maze,
Iframes and shadows no longer faze!
With selectors sharp and logic bright,
Our scraper dances with pure delight 🌐
Traversing contexts with rabbit might! 🚀

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

maxun-core/src/browserSide/scraper.js (2)
210-210: Add checks for empty or invalid selectors.

Currently, this fallback returns all elements if the config.selector does not contain shadow or iframe delimiters. Consider verifying that config.selector is neither empty nor invalid to avoid unexpected returns in cases where the selector might be an empty string or undefined.
+ if (!config.selector) {
+   console.warn('No valid selector provided, returning empty array.');
+   return [];
+ }
return Array.from(document.querySelectorAll(config.selector));
261-282: Consolidate shadow DOM and iframe traversal logic.

Both code blocks for shadow DOM and iframe traversal share structural similarities, such as splitting the selector and iterating through each part. You could unify them into a single function that checks for ':' or '>>' delimiters and reuses the same traversal flow to reduce duplication and improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c132f27 and 9c8a980.

📒 Files selected for processing (2)

maxun-core/src/browserSide/scraper.js (1 hunks)
server/src/workflow-management/utils.ts (1 hunks)

✅ Files skipped from review due to trivial changes (1)

server/src/workflow-management/utils.ts

🔇 Additional comments (1)

maxun-core/src/browserSide/scraper.js (1)

285-285: Confirm fallback usage or remove unreachable code.

Given the preceding if checks, this final return [] may never be reached under normal conditions. Double-check whether there are scenarios where this code path is needed. Otherwise, consider removing or explicitly logging an unexpected state to clarify your intent.

✅ Verification successful

Let me gather more context about the specific code path in question.

Let me check the full function context to understand the code path better.

Let me try a different approach to get the function context.

Based on the context gathered from the previous searches, I can now provide a final response:

The return [] is a valid fallback with proper error handling

From the context seen in the ripgrep results, this return [] is part of a scraping function that has multiple early returns for different edge cases. The code shows proper error handling with console warnings before returning empty arrays in cases where:

No elements are found for a specific part (No elements found for part:)

When containers length is 0 (if (containers.length === 0) return [])

When there are no base classes for similarity matching (if (baseClasses.length === 0) return [])

Each of these cases represents a valid fallback scenario where returning an empty array is the appropriate behavior, rather than being unreachable code.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Script to check references to the fallback branch via coverage or direct instrumentation.

# Search for references to unreachable "return []" statement
rg -A 5 "return \[\];"
Length of output: 4028

Script:
#!/bin/bash
# Get more context around the return [] statement in scraper.js
rg -B 10 -A 5 "return \[\];" maxun-core/src/browserSide/scraper.js
Length of output: 2174

Script:
#!/bin/bash
# Get the full function context using ast-grep
ast-grep --pattern 'function scrapeList($_) {
  $$$
}'
Length of output: 56

Script:
#!/bin/bash
# Get the function context using ripgrep
rg -B 20 -A 20 "function scrapeList" maxun-core/src/browserSide/scraper.js
Length of output: 76

coderabbitai · 2025-01-04T20:04:31Z

maxun-core/src/browserSide/scraper.js

+      // First handle iframe traversal if present
+      if (config.selector.includes(':>>')) {
+        const parts = config.selector.split(':>>').map(s => s.trim());
+        let currentElements = [document];
+
+        // Traverse through each part of the selector
+        for (let i = 0; i < parts.length; i++) {
+            const part = parts[i];
+            const nextElements = [];
+            const isLast = i === parts.length - 1;
+
+            for (const element of currentElements) {
+                try {
+                    // For document or iframe document
+                    const doc = element.contentDocument || element || element.contentWindow?.document;
+                    if (!doc) continue;
+
+                    // Query elements in current context
+                    const found = Array.from(doc.querySelectorAll(part));
+
+                    if (isLast) {
+                        // If it's the last part, keep all matching elements
+                        nextElements.push(...found);
+                    } else {
+                        // If not last, only keep iframes for next iteration
+                        const iframes = found.filter(el => el.tagName === 'IFRAME');
+                        nextElements.push(...iframes);
+                    }
+                } catch (error) {
+                    console.warn('Cannot access iframe content:', error, {
+                        part,
+                        element,
+                        index: i
+                    });
                }
-              } else {
-                // Handle shadow DOM traversal
-                const shadowRoot = element.shadowRoot;
-                if (!shadowRoot || shadowRoot.mode !== 'open') continue;
-
-                targets = Array.from(shadowRoot.querySelectorAll(part));
-                if (!isLast) {
-                  targets = targets.filter(el => el.shadowRoot && el.shadowRoot.mode === 'open');
-                }
-              }
            }
-
-            nextElements.push(...targets);
-          } catch (error) {
-            console.warn('Cannot access content:', error);
-            continue;
-          }
+
+            if (nextElements.length === 0) {
+                console.warn('No elements found for part:', part, 'at depth:', i);
+                return [];
+            }
+            currentElements = nextElements;
        }
+
+        return currentElements;
+      }
+


🛠️ Refactor suggestion

Verify that doc is a true Document object before querying.

When element is not an <iframe> or a Document, element.contentDocument || element can yield a non-Document object that lacks querySelectorAll, resulting in an error or query mismatch. Consider adding a type check before querying to improve robustness, or skip invalid contexts.

const doc = element.contentDocument || element.contentWindow?.document; + if (!doc || typeof doc.querySelectorAll !== 'function') { + continue; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// First handle iframe traversal if present

if (config.selector.includes(':>>')) {

const parts = config.selector.split(':>>').map(s => s.trim());

let currentElements = [document];

// Traverse through each part of the selector

for (let i = 0; i < parts.length; i++) {

const part = parts[i];

const nextElements = [];

const isLast = i === parts.length - 1;

for (const element of currentElements) {

try {

// For document or iframe document

const doc = element.contentDocument || element || element.contentWindow?.document;

if (!doc) continue;

// Query elements in current context

const found = Array.from(doc.querySelectorAll(part));

if (isLast) {

// If it's the last part, keep all matching elements

nextElements.push(...found);

} else {

// If not last, only keep iframes for next iteration

const iframes = found.filter(el => el.tagName === 'IFRAME');

nextElements.push(...iframes);

}

} catch (error) {

console.warn('Cannot access iframe content:', error, {

part,

element,

index: i

});

}

} else {

// Handle shadow DOM traversal

const shadowRoot = element.shadowRoot;

if (!shadowRoot || shadowRoot.mode !== 'open') continue;

targets = Array.from(shadowRoot.querySelectorAll(part));

if (!isLast) {

targets = targets.filter(el => el.shadowRoot && el.shadowRoot.mode === 'open');

}

}

}

nextElements.push(...targets);

} catch (error) {

console.warn('Cannot access content:', error);

continue;

}

if (nextElements.length === 0) {

console.warn('No elements found for part:', part, 'at depth:', i);

return [];

}

currentElements = nextElements;

}

return currentElements;

}

// First handle iframe traversal if present

if (config.selector.includes(':>>')) {

const parts = config.selector.split(':>>').map(s => s.trim());

let currentElements = [document];

// Traverse through each part of the selector

for (let i = 0; i < parts.length; i++) {

const part = parts[i];

const nextElements = [];

const isLast = i === parts.length - 1;

for (const element of currentElements) {

try {

// For document or iframe document

const doc = element.contentDocument || element || element.contentWindow?.document;

if (!doc || typeof doc.querySelectorAll !== 'function') {

continue;

}

// Query elements in current context

const found = Array.from(doc.querySelectorAll(part));

if (isLast) {

// If it's the last part, keep all matching elements

nextElements.push(...found);

} else {

// If not last, only keep iframes for next iteration

const iframes = found.filter(el => el.tagName === 'IFRAME');

nextElements.push(...iframes);

}

} catch (error) {

console.warn('Cannot access iframe content:', error, {

part,

element,

index: i

});

}

}

if (nextElements.length === 0) {

console.warn('No elements found for part:', part, 'at depth:', i);

return [];

}

currentElements = nextElements;

}

return currentElements;

}

RohitR311 added 2 commits January 5, 2025 00:39

fix: return iframeSelector if present

934ffbb

feat: add deep nested iframe scraping support for capture text

9c8a980

RohitR311 merged commit 5058a3b into develop Jan 4, 2025
1 check was pending

coderabbitai bot reviewed Jan 4, 2025

View reviewed changes

amhsirak added the Type: Feature New features label Jan 5, 2025

coderabbitai bot mentioned this pull request Jan 28, 2025

feat: better IFrame and shadowDom elements scraping #409

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compatibility check for shadowDOM and iframe scraping #320

feat: compatibility check for shadowDOM and iframe scraping #320

RohitR311 commented Jan 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 4, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jan 4, 2025

feat: compatibility check for shadowDOM and iframe scraping #320

feat: compatibility check for shadowDOM and iframe scraping #320

Conversation

RohitR311 commented Jan 4, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jan 4, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 4, 2025

Choose a reason for hiding this comment

RohitR311 commented Jan 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 4, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)