How to distinguish tables and figures #28244

dunalduck0 · 2023-01-09T23:47:42Z

I am using prebuilt-layout to extract tables from PDF papers. In this paper example link, the model mistook the Fig 3 on page 5 as a table (a snapshot of the figure is attached at the end).

My question is two-fold:

Is there a built-in way to recognize figures and, therefore filter them out?
If the answer is no, I wanted to leverage the surrounding text (e.g. "Fig 3" or "Table 2") to recognize tables vs figures. I wanted to understand the data field bounding_regions.polygen. It has four X-Y points. What are these four points in tables/figures objects and what is the unit?

xiangyan99 · 2023-01-10T17:40:58Z

Thanks for reaching out.

Could you tell us which library and version are you using??

dunalduck0 · 2023-01-10T19:15:00Z

Hi @xiangyan99, I am using azure-ai-formrecognizer==3.2.0

catalinaperalta · 2023-01-17T18:15:22Z

Thanks for the questions @dunalduck0! There isn't currently an option to enable/disable specifically recognizing figures with prebuilt-layout. Tagging @vkurpad from the service side to provide more insight here.

As for your second question, you can use the properties on bounding region to correlate the other recognized content that falls in the area you wish to search. The points of the polygon are the outline for the specific component. For for instance, the points of the bounding region on a table are those that outline the recognized table in the document. The unit depends on whether it's an image or a PDF. For images the unit is pixels and for PDFs it's inches. Here is the definition of the polygon on bounding region:

        A list of points representing the bounding polygon
        that outlines the document component. The points are listed in
        clockwise order relative to the document component orientation
        starting from the top-left.
        Units are in pixels for images and inches for PDF.

ghost · 2023-01-24T20:03:11Z

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

dunalduck0 · 2023-01-24T22:30:11Z

Thank you @catalinaperalta for the answer. I was able to eliminate figures by checking whether the nearest (either above or bottom) text starts with "Figure" or "Fig". I hope it would work for most well-written paper.

I have 3 additional questions about table extraction quality.

Many tables in the papers of my interest contain both column and row headers. But it seems the package recognizes column headers only without row headers, because I found only two types of data in the extraction: columnHeader and content. There is no rowHeader type.
Often the row/column headers are nested and the extraction struggled to understand them. An example is attached below. In the output (prefix columnHeader_ and content_ is added artifically), the nested column headers are sometimes merged into a single column header (column 2,3,4,5) while sometimes recognized correctly (6,7,8,9). Anything can be done to improve it?

Original table in paper

Form-recognizer extraction

Special symbols, superscripts or subscripts are lost in extracted data (see the same example above). Excel was able to preserve such information (see below), though still struggling with nested headers.

Excel extraction

catalinaperalta · 2023-01-26T02:23:36Z

Glad to help @dunalduck0! These are good questions, seems that the prebuilt-layout algorithm is not recognizing all of the elements you're looking for with this set of documents. It might be that a custom model would help improve recognition for your specific set of documents.
@vkurpad should prebuilt-layout have the ability to return some of these content elements (such as the rowHeader in addition to columnHeader, nested headers, and the special symbols)?

bojunehsu · 2023-01-31T07:41:38Z

Hi @dunalduck0,

We are constantly improving our underlying table extraction algorithm. I was able to get the correct nested column headers via https://formrecognizer.appliedai.azure.com/studio/layout (except 2 missed header text). Can you try again?

We do return rowHeader as a cell type in certain cases. But in this particular table, with no visual indication, it subjective whether the Analog column is a rowHeader. I personally would not label it as such.

The service does not yet support the recognition of super/subscripts, or mathematical formulas in general.

vkurpad · 2023-01-31T08:19:12Z

I tried the same image in the Studio and got the same result shared by @bojunehsu. Could you try updating to the latest SDK version?

There are a few planned updates that should improve the issues with mathematical formulas.

ghost · 2023-02-08T02:03:05Z

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

github-actions bot added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 9, 2023

xiangyan99 added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Jan 10, 2023

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 10, 2023

ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Jan 10, 2023

xiangyan99 added the Cognitive - Form Recognizer label Jan 10, 2023

xiangyan99 assigned kristapratico and catalinaperalta Jan 10, 2023

catalinaperalta unassigned kristapratico Jan 17, 2023

catalinaperalta added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jan 17, 2023

ghost added the no-recent-activity There has been no recent activity on this issue. label Jan 24, 2023

catalinaperalta added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jan 31, 2023

ghost added the no-recent-activity There has been no recent activity on this issue. label Feb 8, 2023

ghost closed this as completed Feb 23, 2023

github-actions bot locked and limited conversation to collaborators May 24, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to distinguish tables and figures #28244

How to distinguish tables and figures #28244

dunalduck0 commented Jan 9, 2023 •

edited

Loading

xiangyan99 commented Jan 10, 2023

dunalduck0 commented Jan 10, 2023

catalinaperalta commented Jan 17, 2023

ghost commented Jan 24, 2023

dunalduck0 commented Jan 24, 2023

catalinaperalta commented Jan 26, 2023

bojunehsu commented Jan 31, 2023

vkurpad commented Jan 31, 2023

ghost commented Feb 8, 2023

How to distinguish tables and figures #28244

How to distinguish tables and figures #28244

Comments

dunalduck0 commented Jan 9, 2023 • edited Loading

xiangyan99 commented Jan 10, 2023

dunalduck0 commented Jan 10, 2023

catalinaperalta commented Jan 17, 2023

ghost commented Jan 24, 2023

dunalduck0 commented Jan 24, 2023

catalinaperalta commented Jan 26, 2023

bojunehsu commented Jan 31, 2023

vkurpad commented Jan 31, 2023

ghost commented Feb 8, 2023

dunalduck0 commented Jan 9, 2023 •

edited

Loading