Troubleshooting Guide: Workflow Orchestration Failures in R2R

Workflow orchestration is a critical component of R2R, managed by Hatchet. When orchestration failures occur, they can disrupt the entire data processing pipeline. This guide will help you identify and resolve common issues.

1. Check Hatchet Service Status

First, ensure that the Hatchet service is running properly:

docker ps | grep hatchet

Look for containers with names like hatchet-engine, hatchet-api, and hatchet-rabbitmq.

2. Examine Hatchet Logs

View the logs for the Hatchet engine:

docker logs r2r-hatchet-engine-1

Look for error messages or warnings that might indicate the source of the problem.

3. Common Issues and Solutions

3.1 Connection Issues with RabbitMQ

Symptom: Hatchet logs show connection errors to RabbitMQ.

Solution:

  1. Check if RabbitMQ is running:
    docker ps | grep rabbitmq
    
  2. Verify RabbitMQ credentials in the Hatchet configuration.
  3. Ensure the RabbitMQ container is in the same Docker network as Hatchet.

3.2 Database Connection Problems

Symptom: Errors in Hatchet logs related to database connections.

Solution:

  1. Verify Postgres container is running and healthy:
    docker ps | grep postgres
    
  2. Check database connection settings in Hatchet configuration.
  3. Ensure the Postgres container is in the same Docker network as Hatchet.

3.3 Workflow Definition Errors

Symptom: Specific workflows fail to start or execute properly.

Solution:

  1. Review the workflow definition in your R2R configuration.
  2. Check for syntax errors or invalid step definitions.
  3. Verify that all required environment variables for the workflow are set.

3.4 Resource Constraints

Symptom: Workflows start but fail due to timeout or resource exhaustion.

Solution:

  1. Check system resources (CPU, memory) on the host machine.
  2. Adjust resource limits for Docker containers if necessary.
  3. Consider scaling up your infrastructure or optimizing workflow resource usage.

3.5 Version Incompatibility

Symptom: Unexpected errors after updating R2R or Hatchet.

Solution:

  1. Ensure all components (R2R, Hatchet, RabbitMQ) are compatible versions.
  2. Check the R2R documentation for any breaking changes in recent versions.
  3. Consider rolling back to a known working version if issues persist.

4. Advanced Debugging

4.1 Inspect Hatchet API

Use the Hatchet API to get more details about workflow executions:

  1. Find the Hatchet API container:
    docker ps | grep hatchet-api
    
  2. Use curl to query the API (replace <container-id> with the actual ID):
    docker exec <container-id> curl http://localhost:7077/api/v1/workflows
    

4.2 Check Hatchet Dashboard

If you have the Hatchet dashboard set up:

  1. Access the dashboard (typically at http://localhost:8002 if using default ports).
  2. Navigate to the Workflows section to view detailed execution status and logs.

4.3 Analyze RabbitMQ Queues

Inspect RabbitMQ queues to check for message backlogs or routing issues:

  1. Access the RabbitMQ management interface (typically at http://localhost:15672).
  2. Check queue lengths, message rates, and any dead-letter queues.

5. Common Workflow-Specific Issues

5.1 Document Ingestion Failures

Symptom: Documents fail to process through the ingestion workflow.

Solution:

  1. Check the Unstructured API configuration and connectivity.
  2. Verify file permissions and formats of ingested documents.
  3. Examine R2R logs for specific ingestion errors.

5.2 Vector Store Update Issues

Symptom: Vector store (Postgres with pgvector) not updating correctly.

Solution:

  1. Check Postgres logs for any errors related to vector operations.
  2. Verify the pgvector extension is properly installed and enabled.
  3. Ensure the R2R configuration correctly specifies the vector store settings.

5.3 Knowledge Graph Generation Failures

Symptom: Neo4j graph not updating or showing incorrect data.

Solution:

  1. Check Neo4j connection settings in the R2R configuration.
  2. Verify Neo4j logs for any query execution errors.
  3. Ensure the workflow steps for graph generation are correctly defined.

6. Seeking Further Help

If you’re still experiencing issues after trying these solutions:

  1. Gather all relevant logs (R2R, Hatchet, RabbitMQ, Postgres, Neo4j).
  2. Document the steps to reproduce the issue.
  3. Check the R2R GitHub repository for similar reported issues.
  4. Consider opening a new issue on the R2R GitHub repository with your findings.

Remember to provide:

  • R2R version (r2r version)
  • Docker Compose configuration
  • Relevant parts of your R2R configuration
  • Detailed error messages and logs

By following this guide, you should be able to diagnose and resolve most workflow orchestration issues in R2R. If problems persist, don’t hesitate to seek help from the R2R community or support channels.