fix: [ISSUE] "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'" in API workflow execution #595

kun432 · 2024-08-19T13:53:38Z

Describe the bug

Following "Getting Started" instructions, but got an error in API workflow. Always the same errors happen.

Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'

To reproduce

Following "Getting Started" instructions, except:

using Japanese invoice PDF
change prompts a little to match with invoice documents.

Errors happens:

create API workflow and reqeust API with Postman (checked logs via workflow)
and also "Run Workflow" with some PDFs in workflow screen.

got the same error always.

Expected behavior

API workflow returns correct parsed data.
"Run Workflow" works correctly and return parsed data.

Environment details

Version: commit 830e563 (tag: v0.81.0)
OS: macOS 13.6
Docker: Docker Desktop for Mac 4.31.0
LLM: OpenAI gpt-4o-mini
Embeddings: OpenAI
VectorDB: Qdrant Cloud
Text Extractor: Llama Parse

Additional context

I have 3 sample PDF.
Parsing all of them works perfectly in Prompt Studio.
OTOH, in both API and Workflow, Parsing all of above always failed.

Screenshots

API reqeust from Postman

and logs

"Run Workflow"

Prompt Studio works.

The text was updated successfully, but these errors were encountered:

Deepak-Kesavan · 2024-08-19T15:09:27Z

Hi @kun432 .

Thanks for trying out Unstract. The issue mentioned above was a regression noticed in v0.81.0 which broke the API deployment and the fix went in PR #592 . Please try out the latest version of Unstract (v0.81.1) and let us know if all the issues you mentioned were resolved.

kun432 · 2024-08-19T15:55:57Z

@Deepak-Kesavan Thanks, but

still got the same error in Workflow execution

API returned different error from before

then seems logs were truncated before finished.

my update instructions:

$ docker compose -f docker/docker-compose.yaml down
$ docker rmi $(docker images | grep "unstract/backend" | awk '{print $3}')
$ ./run-platform.sh -u

(snip)
Fetching release tags.
Performing git checkout to v0.81.1.
Performing git pull on v0.81.1.
(snip)

I will remove all unstract images other than DB, and try again.

$ docker rmi $(docker images | grep "unstract/" | awk '{print $3}')

kun432 · 2024-08-19T16:07:20Z

still unresolved.

Deepak-Kesavan · 2024-08-19T17:02:16Z

Thanks for the update @kun432 . We will investigate this further and get back.

Please include the logs from unstract-worker if possible.

kun432 · 2024-08-19T17:41:03Z

I removed everything including all the container images, volumes, networks and tried again but no luck.

Here's unstract-worker's log.

$ docker compose -f docker/docker-compose.yaml logs | grep unstract-worker

unstract-worker            | [2024-08-19 16:21:15 +0000] [9] [DEBUG] Current configuration:
unstract-worker              |   config: ./gunicorn.conf.py
unstract-worker              |   wsgi_app: None
unstract-worker              |   bind: ['0.0.0.0:5002']
unstract-worker              |   backlog: 2048
unstract-worker              |   workers: 2
unstract-worker              |   worker_class: gevent
unstract-worker              |   threads: 2
unstract-worker              |   worker_connections: 1000
unstract-worker              |   max_requests: 0
unstract-worker              |   max_requests_jitter: 0
unstract-worker              |   timeout: 900
unstract-worker              |   graceful_timeout: 30
unstract-worker              |   keepalive: 2
unstract-worker              |   limit_request_line: 4094
unstract-worker              |   limit_request_fields: 100
unstract-worker              |   limit_request_field_size: 8190
unstract-worker              |   reload: False
unstract-worker              |   reload_engine: auto
unstract-worker              |   reload_extra_files: []
unstract-worker              |   spew: False
unstract-worker              |   check_config: False
unstract-worker              |   print_config: False
unstract-worker              |   preload_app: False
unstract-worker              |   sendfile: None
unstract-worker              |   reuse_port: False
unstract-worker              |   chdir: /app
unstract-worker              |   daemon: False
unstract-worker              |   raw_env: []
unstract-worker              |   pidfile: None
unstract-worker              |   worker_tmp_dir: None
unstract-worker              |   user: 0
unstract-worker              |   group: 0
unstract-worker              |   umask: 0
unstract-worker              |   initgroups: False
unstract-worker              |   tmp_upload_dir: None
unstract-worker              |   secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
unstract-worker              |   forwarded_allow_ips: ['127.0.0.1', '::1']
unstract-worker              |   accesslog: -
unstract-worker              |   disable_redirect_access_to_syslog: False
unstract-worker              |   access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
unstract-worker              |   errorlog: -
unstract-worker              |   loglevel: debug
unstract-worker              |   capture_output: False
unstract-worker              |   logger_class: gunicorn.glogging.Logger
unstract-worker              |   logconfig: None
unstract-worker              |   logconfig_dict: {}
unstract-worker              |   logconfig_json: None
unstract-worker              |   syslog_addr: udp://localhost:514
unstract-worker              |   syslog: False
unstract-worker              |   syslog_prefix: None
unstract-worker              |   syslog_facility: user
unstract-worker              |   enable_stdio_inheritance: False
unstract-worker              |   statsd_host: None
unstract-worker              |   dogstatsd_tags:
unstract-worker              |   statsd_prefix:
unstract-worker              |   proc_name: None
unstract-worker              |   default_proc_name: unstract.worker.main:app
unstract-worker              |   pythonpath: None
unstract-worker              |   paste: None
unstract-worker              |   on_starting: <function OnStarting.on_starting at 0x2aaaac8d14c0>
unstract-worker              |   on_reload: <function OnReload.on_reload at 0x2aaaac8d15e0>
unstract-worker              |   when_ready: <function WhenReady.when_ready at 0x2aaaac8d1700>
unstract-worker              |   pre_fork: <function Prefork.pre_fork at 0x2aaaac8d1820>
unstract-worker              |   post_fork: <function Postfork.post_fork at 0x2aaaac8d1940>
unstract-worker              |   post_worker_init: <function PostWorkerInit.post_worker_init at 0x2aaaac8d1a60>
unstract-worker              |   worker_int: <function WorkerInt.worker_int at 0x2aaaac8d1b80>
unstract-worker              |   worker_abort: <function WorkerAbort.worker_abort at 0x2aaaac8d1ca0>
unstract-worker              |   pre_exec: <function PreExec.pre_exec at 0x2aaaac8d1dc0>
unstract-worker              |   pre_request: <function PreRequest.pre_request at 0x2aaaac8d1ee0>
unstract-worker              |   post_request: <function PostRequest.post_request at 0x2aaaac8d1f70>
unstract-worker              |   child_exit: <function ChildExit.child_exit at 0x2aaaac8f60d0>
unstract-worker              |   worker_exit: <function WorkerExit.worker_exit at 0x2aaaac8f61f0>
unstract-worker              |   nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x2aaaac8f6310>
unstract-worker              |   on_exit: <function OnExit.on_exit at 0x2aaaac8f6430>
unstract-worker              |   ssl_context: <function NewSSLContext.ssl_context at 0x2aaaac8f6550>
unstract-worker              |   proxy_protocol: False
unstract-worker              |   proxy_allow_ips: ['127.0.0.1', '::1']
unstract-worker              |   keyfile: None
unstract-worker              |   certfile: None
unstract-worker              |   ssl_version: 2
unstract-worker              |   cert_reqs: 0
unstract-worker              |   ca_certs: None
unstract-worker              |   suppress_ragged_eofs: True
unstract-worker              |   do_handshake_on_connect: False
unstract-worker              |   ciphers: None
unstract-worker              |   raw_paste_global_conf: []
unstract-worker              |   permit_obsolete_folding: False
unstract-worker              |   strip_header_spaces: False
unstract-worker              |   permit_unconventional_http_method: False
unstract-worker              |   permit_unconventional_http_version: False
unstract-worker              |   casefold_http_method: False
unstract-worker              |   forwarder_headers: ['SCRIPT_NAME', 'PATH_INFO']
unstract-worker              |   header_map: drop
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [INFO] Starting gunicorn 23.0.0
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [DEBUG] Arbiter booted
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [INFO] Listening at: http://0.0.0.0:5002 (9)
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [INFO] Using worker: gevent
unstract-worker              | [2024-08-19 16:21:15 +0000] [12] [INFO] Booting worker with pid: 12
unstract-worker              | [2024-08-19 16:21:15 +0000] [14] [INFO] Booting worker with pid: 14
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [DEBUG] 2 workers
unstract-worker              | [2024-08-19 16:42:09 +0000] [14] [DEBUG] POST /v1/api/container/run
unstract-worker              | [2024-08-19 16:42:09,625] INFO in docker: Image 'unstract/tool-structure:0.0.37' not found in the local system.
unstract-worker              | [2024-08-19 16:42:09,626] INFO in docker: Pulling the container: unstract/tool-structure:0.0.37
unstract-worker              | [2024-08-19 16:42:16,740] INFO in docker: CONTAINER PULL STATUS: Downloading - 9317ce34db73 : [===================================>               ]  3.146MB/4.415MB
unstract-worker              | [2024-08-19 16:42:21,440] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [==>                                                ]  20.97MB/383.1MB
unstract-worker              | [2024-08-19 16:42:26,540] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [===>                                               ]  29.36MB/383.1MB
unstract-worker              | [2024-08-19 16:42:31,543] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [=======>                                           ]  55.57MB/383.1MB
unstract-worker              | [2024-08-19 16:42:36,640] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [=========>                                         ]   73.4MB/383.1MB
unstract-worker              | [2024-08-19 16:42:41,645] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [================================>                  ]  174.1MB/267.4MB
unstract-worker              | [2024-08-19 16:42:46,743] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [=====================================>             ]  198.2MB/267.4MB
unstract-worker              | [2024-08-19 16:42:51,749] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [=========================================>         ]  220.2MB/267.4MB
unstract-worker              | [2024-08-19 16:42:56,844] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [=============================================>     ]  244.3MB/267.4MB
unstract-worker              | [2024-08-19 16:43:01,853] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [========================>                          ]  185.6MB/383.1MB
unstract-worker              | [2024-08-19 16:43:11,745] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [====================================>              ]  277.9MB/383.1MB
unstract-worker              | [2024-08-19 16:43:21,844] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [================================================>  ]  372.2MB/383.1MB
unstract-worker              | [2024-08-19 16:43:31,414] INFO in docker: Finished pulling the container: unstract/tool-structure:0.0.37
unstract-worker              | [2024-08-19 16:43:31,417] INFO in docker: Docker config: {'name': 'unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/3e48657a-22c3-4fe5-81dc-e42789e3c465', 'target': '/data'}], 'labels': []}
unstract-worker              | [2024-08-19 16:43:32,118] INFO in worker: Running Docker container: unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3
unstract-worker              | [2024-08-19 16:43:59,247] ERROR in worker: Error while running docker container unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | ERROR:unstract.worker.main:Error while running docker container unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | 192.168.112.14 - - [19/Aug/2024:16:43:59 +0000] "POST /v1/api/container/run HTTP/1.1" 200 132 "-" "python-requests/2.31.0"
unstract-worker              | [2024-08-19 16:43:59 +0000] [14] [DEBUG] Closing connection.
unstract-worker              | [2024-08-19 16:44:52 +0000] [12] [DEBUG] POST /v1/api/container/run
unstract-worker              | [2024-08-19 16:44:53,006] INFO in docker: Image 'unstract/tool-structure:0.0.37' found in the local system.
unstract-worker              | [2024-08-19 16:44:53,007] INFO in docker: Docker config: {'name': 'unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/914b1f13-ce9b-4716-8996-d8dd154023c6', 'target': '/data'}], 'labels': []}
unstract-worker              | [2024-08-19 16:44:53,261] INFO in worker: Running Docker container: unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13
unstract-worker              | [2024-08-19 16:45:33 +0000] [14] [DEBUG] POST /v1/api/container/run
unstract-worker              | [2024-08-19 16:45:33,992] INFO in docker: Image 'unstract/tool-structure:0.0.37' found in the local system.
unstract-worker              | INFO:unstract.worker.main:Image 'unstract/tool-structure:0.0.37' found in the local system.
unstract-worker              | [2024-08-19 16:45:33,993] INFO in docker: Docker config: {'name': 'unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/814d91a5-e474-436e-88ca-94305bcc9e4d', 'target': '/data'}], 'labels': []}
unstract-worker              | INFO:unstract.worker.main:Docker config: {'name': 'unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/814d91a5-e474-436e-88ca-94305bcc9e4d', 'target': '/data'}], 'labels': []}
unstract-worker              | [2024-08-19 16:45:34,230] INFO in worker: Running Docker container: unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26
unstract-worker              | INFO:unstract.worker.main:Running Docker container: unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26
unstract-worker              | [2024-08-19 16:45:44,519] ERROR in worker: Error while running docker container unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | ERROR:unstract.worker.main:Error while running docker container unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | 192.168.112.14 - - [19/Aug/2024:16:45:44 +0000] "POST /v1/api/container/run HTTP/1.1" 200 132 "-" "python-requests/2.31.0"
unstract-worker              | [2024-08-19 16:45:44 +0000] [12] [DEBUG] Ignoring EPIPE
unstract-worker              | [2024-08-19 16:46:04,721] ERROR in worker: Error while running docker container unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | ERROR:unstract.worker.main:Error while running docker container unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | 192.168.112.14 - - [19/Aug/2024:16:46:05 +0000] "POST /v1/api/container/run HTTP/1.1" 200 132 "-" "python-requests/2.31.0"
unstract-worker              | [2024-08-19 16:46:05 +0000] [14] [DEBUG] Closing connection.

kun432 · 2024-08-19T17:50:55Z

also some things I found:

Using Llama Parse as text extractor, "Getting Started" instruction with "credit card statements" didn't work even I used demo files. In that case, parsing failed a lot.
Changed text extractor from Llama Parse to LLMWhisperer, "Getting Started" instruction with "credit card statements" works perfect.
Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

Hope these are of some help.

Deepak-Kesavan · 2024-08-19T18:32:01Z

Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'

Regarding the above error, we are using the file named INFILE without extension. But somehow it is looking for the file name INFILE with extension pdf. Looks like this is the issue you are facing when running the workflow or calling the API. I will see if I can replicate the same in my machine and provide you with a proper solution or raise a PR is this needs fix.

Deepak-Kesavan · 2024-08-20T06:58:39Z

@kun432 I am unable to replicate the issue. I even tried renaming the file to Japanese text, and it ran successfully. Have you tried using a different file other than the one you are currently using?

Additionally, by setting REMOVE_CONTAINER_ON_EXIT=False in the worker's .env file, you can prevent the tool container from being removed, which might provide additional logs.

ritwik-g · 2024-08-20T07:22:22Z

@kun432 in addition to what @Deepak-Kesavan suggested can you also check if there are any files with in the below folder

/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/3e48657a-22c3-4fe5-81dc-e42789e3c465

Please share the ls output on this folder.

kun432 · 2024-08-21T15:54:41Z

@Deepak-Kesavan

I even tried renaming the file to Japanese text, and it ran successfully.

This means you used Llama Parse as text extractor?

Because, as I said before,

Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

so, I don't think this problems come from my Japanese invoice PDF files.

Deepak-Kesavan · 2024-08-21T20:13:40Z

@Deepak-Kesavan

I even tried renaming the file to Japanese text, and it ran successfully.

This means you used Llama Parse as text extractor?

Because, as I said before,

Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

so, I don't think this problems come from my Japanese invoice PDF files.

@kun432 I initially thought the issue might be due to the name of the PDF, but it seems that was not the case.

Could you please provide the information mentioned in the comments above by @ritwik-g and me so we can debug this further?

ritwik-g · 2024-08-22T05:22:08Z

also some things I found:

Using Llama Parse as text extractor, "Getting Started" instruction with "credit card statements" didn't work even I used demo files. In that case, parsing failed a lot.

Changed text extractor from Llama Parse to LLMWhisperer, "Getting Started" instruction with "credit card statements" works perfect.

Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

Hope these are of some help.

@kun432 I missed this message earlier. If your use case is working fine with LLMWhisperer I think then the issue might be that the Llama Parser fails to parse japanese text? So can you confirm if the issue is happening mainly with Llama Parse? If that's the case you might need to try using Llama Parse directly once to see if the extraction is working or not.

kun432 · 2024-08-22T14:21:14Z

@Deepak-Kesavan @ritwik-g

(after removed all cloned repo, containers, images, volumes and re-clone) newly set up and tested with Llama Parse as text extractor, and the same error happened again (currently seems 100% reproducable).

I summarized my whole setup procedures and logs below:
https://gist.github.com/kun432/a8d7238c9c1fd738aed5f7d7771ba4a5

kun432 · 2024-08-23T08:27:20Z

adding the result of using LLMWhisperer above Gist (see the last comment)

I missed this message earlier. If your use case is working fine with LLMWhisperer I think then the issue might be that the Llama Parser fails to parse japanese text? So can you confirm if the issue is happening mainly with Llama Parse? If that's the case you might need to try using Llama Parse directly once to see if the extraction is working or not.

Llama Parse CAN handle Japanese text.

result

Using Llama Parse, as I attached a screen shot in my first comment, extracting seems working in Prompt Studio (this means the document was parsed using Llama Parse, right?).

so I guess there's something wrong in workflow execution and it will show up only when using Llama Parse.

ritwik-g · 2024-08-23T09:06:54Z

@kun432 yes this might be llama parse specific problem. Thanks for the detailed steps for reproducing. Let us take a look in to this.

ritwik-g · 2024-08-23T09:20:23Z

@kun432 looks like this is an issue already reported by our QA. This is a high priority bug but we are working on some other critical items. Will be picking this up as soon as possible.

For the time being if you are able to make use of llmwhisprer please try to use it.

kun432 added the bug Something isn't working label Aug 19, 2024

Deepak-Kesavan self-assigned this Aug 19, 2024

ritwik-g assigned harini-venkataraman and unassigned Deepak-Kesavan Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [ISSUE] "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'" in API workflow execution #595

fix: [ISSUE] "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'" in API workflow execution #595

kun432 commented Aug 19, 2024 •

edited

Loading

Deepak-Kesavan commented Aug 19, 2024 •

edited

Loading

kun432 commented Aug 19, 2024 •

edited

Loading

kun432 commented Aug 19, 2024

Deepak-Kesavan commented Aug 19, 2024

kun432 commented Aug 19, 2024

kun432 commented Aug 19, 2024 •

edited

Loading

Deepak-Kesavan commented Aug 19, 2024

Deepak-Kesavan commented Aug 20, 2024

ritwik-g commented Aug 20, 2024

kun432 commented Aug 21, 2024 •

edited

Loading

Deepak-Kesavan commented Aug 21, 2024

ritwik-g commented Aug 22, 2024

kun432 commented Aug 22, 2024

kun432 commented Aug 23, 2024 •

edited

Loading

ritwik-g commented Aug 23, 2024 •

edited

Loading

ritwik-g commented Aug 23, 2024

fix: [ISSUE] "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'" in API workflow execution #595

fix: [ISSUE] "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'" in API workflow execution #595

Comments

kun432 commented Aug 19, 2024 • edited Loading

Describe the bug

To reproduce

Expected behavior

Environment details

Additional context

Screenshots

Deepak-Kesavan commented Aug 19, 2024 • edited Loading

kun432 commented Aug 19, 2024 • edited Loading

kun432 commented Aug 19, 2024

Deepak-Kesavan commented Aug 19, 2024

kun432 commented Aug 19, 2024

kun432 commented Aug 19, 2024 • edited Loading

Deepak-Kesavan commented Aug 19, 2024

Deepak-Kesavan commented Aug 20, 2024

ritwik-g commented Aug 20, 2024

kun432 commented Aug 21, 2024 • edited Loading

Deepak-Kesavan commented Aug 21, 2024

ritwik-g commented Aug 22, 2024

kun432 commented Aug 22, 2024

kun432 commented Aug 23, 2024 • edited Loading

ritwik-g commented Aug 23, 2024 • edited Loading

ritwik-g commented Aug 23, 2024

kun432 commented Aug 19, 2024 •

edited

Loading

Deepak-Kesavan commented Aug 19, 2024 •

edited

Loading

kun432 commented Aug 19, 2024 •

edited

Loading

kun432 commented Aug 19, 2024 •

edited

Loading

kun432 commented Aug 21, 2024 •

edited

Loading

kun432 commented Aug 23, 2024 •

edited

Loading

ritwik-g commented Aug 23, 2024 •

edited

Loading