Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to distinguish tables and figures #28244

Closed
dunalduck0 opened this issue Jan 9, 2023 · 9 comments
Closed

How to distinguish tables and figures #28244

dunalduck0 opened this issue Jan 9, 2023 · 9 comments
Assignees
Labels
Cognitive - Form Recognizer needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue.

Comments

@dunalduck0
Copy link

dunalduck0 commented Jan 9, 2023

I am using prebuilt-layout to extract tables from PDF papers. In this paper example link, the model mistook the Fig 3 on page 5 as a table (a snapshot of the figure is attached at the end).

My question is two-fold:

  1. Is there a built-in way to recognize figures and, therefore filter them out?
  2. If the answer is no, I wanted to leverage the surrounding text (e.g. "Fig 3" or "Table 2") to recognize tables vs figures. I wanted to understand the data field bounding_regions.polygen. It has four X-Y points. What are these four points in tables/figures objects and what is the unit?

image

@github-actions github-actions bot added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 9, 2023
@xiangyan99
Copy link
Member

Thanks for reaching out.

Could you tell us which library and version are you using??

@xiangyan99 xiangyan99 added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Jan 10, 2023
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 10, 2023
@dunalduck0
Copy link
Author

Hi @xiangyan99, I am using azure-ai-formrecognizer==3.2.0

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Jan 10, 2023
@catalinaperalta
Copy link
Member

Thanks for the questions @dunalduck0! There isn't currently an option to enable/disable specifically recognizing figures with prebuilt-layout. Tagging @vkurpad from the service side to provide more insight here.

As for your second question, you can use the properties on bounding region to correlate the other recognized content that falls in the area you wish to search. The points of the polygon are the outline for the specific component. For for instance, the points of the bounding region on a table are those that outline the recognized table in the document. The unit depends on whether it's an image or a PDF. For images the unit is pixels and for PDFs it's inches. Here is the definition of the polygon on bounding region:

        A list of points representing the bounding polygon
        that outlines the document component. The points are listed in
        clockwise order relative to the document component orientation
        starting from the top-left.
        Units are in pixels for images and inches for PDF.

@catalinaperalta catalinaperalta added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jan 17, 2023
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Jan 24, 2023
@ghost
Copy link

ghost commented Jan 24, 2023

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@dunalduck0
Copy link
Author

Thank you @catalinaperalta for the answer. I was able to eliminate figures by checking whether the nearest (either above or bottom) text starts with "Figure" or "Fig". I hope it would work for most well-written paper.

I have 3 additional questions about table extraction quality.

  1. Many tables in the papers of my interest contain both column and row headers. But it seems the package recognizes column headers only without row headers, because I found only two types of data in the extraction: columnHeader and content. There is no rowHeader type.
  2. Often the row/column headers are nested and the extraction struggled to understand them. An example is attached below. In the output (prefix columnHeader_ and content_ is added artifically), the nested column headers are sometimes merged into a single column header (column 2,3,4,5) while sometimes recognized correctly (6,7,8,9). Anything can be done to improve it?

Original table in paper
image

Form-recognizer extraction
image

  1. Special symbols, superscripts or subscripts are lost in extracted data (see the same example above). Excel was able to preserve such information (see below), though still struggling with nested headers.

Excel extraction
image

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. labels Jan 24, 2023
@catalinaperalta
Copy link
Member

Glad to help @dunalduck0! These are good questions, seems that the prebuilt-layout algorithm is not recognizing all of the elements you're looking for with this set of documents. It might be that a custom model would help improve recognition for your specific set of documents.
@vkurpad should prebuilt-layout have the ability to return some of these content elements (such as the rowHeader in addition to columnHeader, nested headers, and the special symbols)?

@bojunehsu
Copy link
Member

Hi @dunalduck0,

We are constantly improving our underlying table extraction algorithm. I was able to get the correct nested column headers via https://formrecognizer.appliedai.azure.com/studio/layout (except 2 missed header text). Can you try again?
image

We do return rowHeader as a cell type in certain cases. But in this particular table, with no visual indication, it subjective whether the Analog column is a rowHeader. I personally would not label it as such.

The service does not yet support the recognition of super/subscripts, or mathematical formulas in general.

@vkurpad
Copy link
Member

vkurpad commented Jan 31, 2023

I tried the same image in the Studio and got the same result shared by @bojunehsu. Could you try updating to the latest SDK version?

There are a few planned updates that should improve the issues with mathematical formulas.

@catalinaperalta catalinaperalta added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jan 31, 2023
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Feb 8, 2023
@ghost
Copy link

ghost commented Feb 8, 2023

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@ghost ghost closed this as completed Feb 23, 2023
@github-actions github-actions bot locked and limited conversation to collaborators May 24, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Cognitive - Form Recognizer needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue.
Projects
None yet
Development

No branches or pull requests

6 participants