Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse PDFs, "Failed to resolve 'test.elicit.org'" #327

Open
mathcass opened this issue May 27, 2024 · 1 comment
Open

Unable to parse PDFs, "Failed to resolve 'test.elicit.org'" #327

mathcass opened this issue May 27, 2024 · 1 comment

Comments

@mathcass
Copy link

When trying to run the "Loading paper text" chapter from the Primer, I run into an error indicating that it can't find "test.elicit.org". Since paper.parse_pdf depends on this remote resource to parse the PDF, it can't proceed at all.

Here's a full trace of what I see:

Full trace
python recipes/paper_hello.py --paper papers/keenan-2018.pdf
/home/cass/src/ice/venv/lib/python3.11/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.generics:GenericModel` has been moved to `pydantic.BaseModel`.
  warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
/home/cass/src/ice/venv/lib/python3.11/site-packages/pydantic/_internal/_config.py:334: UserWarning: Valid config keys have changed in V2:
* 'keep_untouched' has been renamed to 'ignored_types'
* 'fields' has been removed
  warnings.warn(message, UserWarning)
Traceback (most recent call last):
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/socket.py", line 961, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 491, in _make_request
    raise new_e
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    self._validate_conn(conn)
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
    conn.connect()
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connection.py", line 616, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x751bb09bb4d0>: Failed to resolve 'test.elicit.org' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/adapters.py", line 589, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='test.elicit.org', port=443): Max retries exceeded with url: /elicit-previews/james/oug-3083-support-parsing-arbitrary-pdfs-using/parse_pdf (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x751bb09bb4d0>: Failed to resolve 'test.elicit.org' ([Errno -2] Name or service not known)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cass/src/ice/recipes/paper_hello.py", line 10, in <module>
    recipe.main(answer_for_paper)
  File "/home/cass/src/ice/ice/recipe.py", line 176, in main
    defopt.run(
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/defopt.py", line 348, in run
    call = bind(
           ^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/defopt.py", line 255, in bind
    call, rest = _bind_or_bind_known(*args, _known=False, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/defopt.py", line 203, in _bind_or_bind_known
    args, rest = parser.parse_args(argv), []
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 1862, in parse_args
    args, argv = self.parse_known_args(args, namespace)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 1895, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2103, in _parse_known_args
    start_index = consume_optional(start_index)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2043, in consume_optional
    take_action(action, args, option_string)
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 1955, in take_action
    argument_values = self._get_values(action, argument_strings)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2485, in _get_values
    value = self._get_value(action, arg_string)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2518, in _get_value
    result = type_func(arg_string)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/ice/recipe.py", line 181, in <lambda>
    Paper: lambda path: Paper.load(Path(path)),
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/ice/paper.py", line 158, in load
    paragraph_dicts = parse_pdf(file)
                      ^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/ice/cache.py", line 28, in sync_wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/ice/paper.py", line 119, in parse_pdf
    r = requests.post(PDF_PARSER_URL, files=files)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/adapters.py", line 622, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='test.elicit.org', port=443): Max retries exceeded with url: /elicit-previews/james/oug-3083-support-parsing-arbitrary-pdfs-using/parse_pdf (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x751bb09bb4d0>: Failed to resolve 'test.elicit.org' ([Errno -2] Name or service not known)"))

❓ Is there an alternative that folks recommend for PDF parsing here?

@TommyBark
Copy link
Contributor

I have quick-fixed this here https://github.com/TommyBark/ice/tree/fix-parse_pdf by using pdfminer.six package.
The semantic chunking is not very reliable as it is done based on html parsing and not all pdfs work nicely with it, but it works as proof of concept for the Factored Cognition Primer examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants