-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with partition_pdf #2316
Comments
Hi @pranavbhat12, |
Thankyou for reaching out!! I did just !pip install "unstructured[all-docs]" and pdf which I used is attached.Also tried downgrading to version 0.11.5. |
This issue is mainly when setting strategy to "hi_res".As per the error, problem strectches to unstructured_inference library with TableTransformers code. |
Issue seems to be stemming from
It seems that getting text_as_html metadata is having issues. For now, adding this try except fix is working temporarily to send blank string on error, but permanent fix would be advisable in this case.
The actual error seems to be happening in unstructure_inference > models > tables.py line:190
The recognize method seems to have been empty array since there are no tokens derived, I suppose. Help on fixing this would be appreciated. Thanks |
Just a correction -> This issue happens when setting infer_table_structure = True. |
After installation of tesseract, it is showing AttributeError error. If anyone know how to solve this problem, please reply asap . |
Hi @pranavbhat12 @HardKothari @DeepKariaX @Aarsh01 This issue appears to be related to #3119 and should be resolved by the changes implemented in PR #3130. |
@christinestraub I have updated the version and getting : ValueError: max() arg is an empty sequence unstructured_inference/models/tables.py", line 667, in fill_cells Unfortunately, I cannot share the pdf - when i keep the infer_table_structure = True parameter it is giving me this error and after removing this parameter it is working perfectly. |
Similar to #3252, closing this since it's assumed to be resolved, but feel free to reopen if you're still having this issue. |
While trying to read pdf file with partition_pdf function I am getting this error:
RuntimeError Traceback (most recent call last)
Cell In[11], line 7
4 from unstructured.partition.pdf import partition_pdf
6 # Get elements
----> 7 raw_pdf_elements = partition_pdf(
8 filename="docs/sample.pdf",
9 # Unstructured first finds embedded image blocks
10 extract_images_in_pdf=False,
11 strategy="hi_res",
12 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
13 # Titles are any sub-section of the document
14 infer_table_structure=True,
15 # Post processing to aggregate text once we have the title
16 chunking_strategy="by_title",
17 # Chunking params to aggregate text blocks
18 # Attempt to create a new chunk 3800 chars
19 # Attempt to keep chunks > 2000 chars
20 max_characters=1000,
21 new_after_n_chars=500,
22 combine_text_under_n_chars=200,
23 image_output_dir_path="docs/",
24
25 )
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/documents/elements.py:514, in process_metadata..decorator..wrapper(*args, **kwargs)
512 @functools.wraps(func)
513 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 514 elements = func(*args, **kwargs)
515 sig = inspect.signature(func)
516 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/file_utils/filetype.py:591, in add_filetype..decorator..wrapper(*args, **kwargs)
589 @functools.wraps(func)
590 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 591 elements = func(*args, **kwargs)
592 sig = inspect.signature(func)
593 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/file_utils/filetype.py:546, in add_metadata..wrapper(*args, **kwargs)
544 @functools.wraps(func)
545 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 546 elements = func(*args, **kwargs)
547 sig = inspect.signature(func)
548 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/chunking/init.py:52, in add_chunking_strategy..decorator..wrapper(*args, **kwargs)
50 @functools.wraps(func)
51 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
---> 52 elements = func(*args, **kwargs)
53 sig = inspect.signature(func)
54 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf.py:191, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, links, extract_images_in_pdf, extract_element_types, image_output_dir_path, **kwargs)
187 exactly_one(filename=filename, file=file)
189 languages = check_languages(languages, ocr_languages)
--> 191 return partition_pdf_or_image(
192 filename=filename,
193 file=file,
194 include_page_breaks=include_page_breaks,
195 strategy=strategy,
196 infer_table_structure=infer_table_structure,
197 languages=languages,
198 metadata_last_modified=metadata_last_modified,
199 extract_images_in_pdf=extract_images_in_pdf,
200 extract_element_types=extract_element_types,
201 image_output_dir_path=image_output_dir_path,
202 **kwargs,
203 )
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf.py:505, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, extract_images_in_pdf, extract_element_types, image_output_dir_path, **kwargs)
503 with warnings.catch_warnings():
504 warnings.simplefilter("ignore")
--> 505 elements = _partition_pdf_or_image_local(
506 filename=filename,
507 file=spooled_to_bytes_io_if_needed(file),
508 is_image=is_image,
509 infer_table_structure=infer_table_structure,
510 include_page_breaks=include_page_breaks,
511 languages=languages,
512 metadata_last_modified=metadata_last_modified or last_modification_date,
513 pdf_text_extractable=pdf_text_extractable,
514 extract_images_in_pdf=extract_images_in_pdf,
515 extract_element_types=extract_element_types,
516 image_output_dir_path=image_output_dir_path,
517 **kwargs,
518 )
519 out_elements = _process_uncategorized_text_elements(elements)
521 elif strategy == PartitionStrategy.FAST:
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/utils.py:214, in requires_dependencies..decorator..wrapper(*args, **kwargs)
205 if len(missing_deps) > 0:
206 raise ImportError(
207 f"Following dependencies are missing: {', '.join(missing_deps)}. "
208 + (
(...)
212 ),
213 )
--> 214 return func(*args, **kwargs)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf.py:321, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, metadata_last_modified, pdf_text_extractable, extract_images_in_pdf, extract_element_types, image_output_dir_path, pdf_image_dpi, analysis, analyzed_image_output_dir_path, **kwargs)
319 final_document_layout = merged_document_layout
320 else:
--> 321 final_document_layout = process_file_with_ocr(
322 filename,
323 merged_document_layout,
324 is_image=is_image,
325 infer_table_structure=infer_table_structure,
326 ocr_languages=ocr_languages,
327 ocr_mode=ocr_mode,
328 pdf_image_dpi=pdf_image_dpi,
329 )
330 else:
331 inferred_document_layout = process_data_with_model(
332 file,
333 is_image=is_image,
334 model_name=model_name,
335 pdf_image_dpi=pdf_image_dpi,
336 )
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:171, in process_file_with_ocr(filename, out_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
169 except Exception as e:
170 if os.path.isdir(filename) or os.path.isfile(filename):
--> 171 raise e
172 else:
173 raise FileNotFoundError(f'File "{filename}" not found!') from e
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:160, in process_file_with_ocr(filename, out_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
158 for i, image_path in enumerate(image_paths):
159 with PILImage.open(image_path) as image:
--> 160 merged_page_layout = supplement_page_layout_with_ocr(
161 out_layout.pages[i],
162 image,
163 infer_table_structure=infer_table_structure,
164 ocr_languages=ocr_languages,
165 ocr_mode=ocr_mode,
166 )
167 merged_page_layouts.append(merged_page_layout)
168 return DocumentLayout.from_pages(merged_page_layouts)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:237, in supplement_page_layout_with_ocr(page_layout, image, infer_table_structure, ocr_languages, ocr_mode)
234 if tables.tables_agent is None:
235 raise RuntimeError("Unable to load table extraction agent.")
--> 237 page_layout.elements[:] = supplement_element_with_table_extraction(
238 elements=cast(List[LayoutElement], page_layout.elements),
239 image=image,
240 tables_agent=tables.tables_agent,
241 ocr_languages=ocr_languages,
242 ocr_agent=ocr_agent,
243 )
245 return page_layout
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:274, in supplement_element_with_table_extraction(elements, image, tables_agent, ocr_languages, ocr_agent)
263 cropped_image = image.crop(
264 (
265 padded_element.bbox.x1,
(...)
269 ),
270 )
271 table_tokens = get_table_tokens(
272 image=cropped_image, ocr_languages=ocr_languages, ocr_agent=ocr_agent
273 )
--> 274 element.text_as_html = tables_agent.predict(cropped_image, ocr_tokens=table_tokens)
275 return elements
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured_inference/models/tables.py:53, in UnstructuredTableTransformerModel.predict(self, x, ocr_tokens)
37 """Predict table structure deferring to run_prediction with ocr tokens
38
39 Note:
(...)
50 FIXME: refactor token data into a dataclass so we have clear expectations of the fields
51 """
52 super().predict(x)
---> 53 return self.run_prediction(x, ocr_tokens=ocr_tokens)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured_inference/models/tables.py:182, in UnstructuredTableTransformerModel.run_prediction(self, x, pad_for_structure_detection, ocr_tokens, result_format)
174 def run_prediction(
175 self,
176 x: Image,
(...)
179 result_format: Optional[str] = "html",
180 ):
181 """Predict table structure"""
--> 182 outputs_structure = self.get_structure(x, pad_for_structure_detection)
183 if ocr_tokens is None:
184 logger.warning(
185 "Table OCR from get_tokens method will be deprecated. "
186 "In the future the OCR tokens are expected to be passed in.",
187 )
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured_inference/models/tables.py:169, in UnstructuredTableTransformerModel.get_structure(self, x, pad_for_structure_detection)
164 with torch.no_grad():
165 logger.info(f"padding image by {pad_for_structure_detection} for structure detection")
166 encoding = self.feature_extractor(
167 pad_image_with_background_color(x, pad_for_structure_detection),
168 return_tensors="pt",
--> 169 ).to(self.device)
170 outputs_structure = self.model(**encoding)
171 outputs_structure["pad_for_structure_detection"] = pad_for_structure_detection
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/transformers/feature_extraction_utils.py:231, in BatchFeature.to(self, *args, **kwargs)
227 for k, v in self.items():
228 # check if v is a floating point
229 if torch.is_floating_point(v):
230 # cast and send to device
--> 231 new_data[k] = v.to(*args, **kwargs)
232 elif device is not None:
233 new_data[k] = v.to(device=device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Code:
from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
Get elements
raw_pdf_elements = partition_pdf(
filename="docs/sample.pdf",
# Unstructured first finds embedded image blocks
extract_images_in_pdf=False,
strategy="hi_res",
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,
# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
max_characters=1000,
new_after_n_chars=500,
combine_text_under_n_chars=200,
image_output_dir_path="docs/",
)
How can we solve this error ?
The text was updated successfully, but these errors were encountered: