Troubleshooting - OpenSail

Overview

This page collects the most common issues encountered when developing, deploying, and self-hosting OpenSail. Each section includes symptoms, diagnosis commands, root causes, and solutions. If you are new to the codebase, scan the section headers to find the category that matches your problem.

Container Issues

Devserver Image Missing

Symptoms: User project containers fail to start. Pods stuck in ImagePullBackOff or ErrImagePull. Diagnosis:

# Kubernetes: check which image is being requested
kubectl describe pod -n proj-<uuid> | grep Image

# Check backend environment variable
kubectl exec -n tesslate deployment/tesslate-backend -- env | grep K8S_DEVSERVER

# Docker: check if the image exists locally
docker images | grep tesslate-devserver

Root cause: The tesslate-devserver image was never built or loaded into the cluster. Solution:

Docker Compose
Minikube
AWS EKS

docker build -t tesslate-devserver:latest -f orchestrator/Dockerfile.devserver orchestrator/

docker build -t tesslate-devserver:latest -f orchestrator/Dockerfile.devserver orchestrator/
minikube -p tesslate image load tesslate-devserver:latest

docker build -t tesslate-devserver:latest -f orchestrator/Dockerfile.devserver orchestrator/
docker tag tesslate-devserver:latest <ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tesslate-devserver:latest
docker push <ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tesslate-devserver:latest

ImagePullBackOff

Symptoms: Pod stuck in ImagePullBackOff state. Diagnosis:

kubectl describe pod <pod-name> -n <namespace>
# Look for "Failed to pull image" in Events

Common causes and solutions:

Image not loaded into cluster (Minikube): Run minikube -p tesslate image load <image>:latest
ECR credentials expired (AWS): Re-authenticate: aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
Wrong image name in config: Verify K8S_DEVSERVER_IMAGE in the backend environment matches the actual image name

Pod Stuck in CrashLoopBackOff

Symptoms: Pod repeatedly crashes and restarts. Diagnosis:

# Check current pod status
kubectl get pods -n <namespace>

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check container logs (current and previous)
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Common causes:

Missing environment variables: Verify secrets are properly mounted: kubectl exec -n tesslate deployment/tesslate-backend -- env | grep DATABASE
Database connection failure: Check DATABASE_URL and ensure the database pod is running
Missing Python dependencies: Rebuild the image with --no-cache

Namespace Stuck in Terminating

Symptoms: A project namespace stays in Terminating state and never completes deletion. Diagnosis:

kubectl get ns | grep proj-
kubectl get all -n proj-<uuid>

Solution: Force-delete the namespace by removing its finalizers:

kubectl get ns proj-<uuid> -o json | \
  jq '.spec.finalizers = []' | \
  kubectl replace --raw "/api/v1/namespaces/proj-<uuid>/finalize" -f -

Force-deleting a namespace skips finalizer cleanup. Ensure no critical resources (like PVCs with important data) are left orphaned.

Database Issues

Connection Refused

Symptoms: Backend logs show Connection refused or timeout errors for PostgreSQL. Diagnosis:

# Docker: check postgres container
docker compose ps postgres

# Kubernetes: check postgres pod
kubectl get pods -n tesslate | grep postgres
kubectl logs -n tesslate deployment/tesslate-postgres

Common causes:

Database not running: Restart it: docker compose up -d postgres or kubectl rollout restart deployment/tesslate-postgres -n tesslate
Wrong DATABASE_URL: Verify the format: postgresql+asyncpg://user:pass@host:5432/dbname. Check with: kubectl exec -n tesslate deployment/tesslate-backend -- env | grep DATABASE_URL
Network policy blocking: Ensure the NetworkPolicy allows backend-to-database traffic

Migration Errors

Symptoms: alembic upgrade head fails. Diagnosis:

cd orchestrator
alembic current   # Show current revision
alembic history   # Show migration history

Common issues and solutions:

Multiple heads detected

alembic: ERROR: Multiple heads detected

Two developers created migrations from the same revision. Merge them:

alembic merge heads -m "merge_heads"
alembic upgrade head

Relation does not exist

relation "tablename" does not exist

Run all pending migrations:

alembic upgrade head

Migration partially applied

If a migration fails midway, check the current state and fix manually:

alembic current
# If the migration was already applied manually, stamp it:
alembic stamp <revision_id>

Autogenerate misses changes

Ensure all model files are imported in alembic/env.py:

from app.database import Base
from app import models
from app import models_kanban
from app import models_auth

Database Seeding Failures

Symptoms: Seed scripts fail or produce no data. Diagnosis:

# Check if migrations have been applied
docker exec tesslate-orchestrator alembic current

# Run seed script with verbose output
docker exec -e PYTHONPATH=/app tesslate-orchestrator python /tmp/seed_marketplace_bases.py

Common causes:

Migrations not applied: Run alembic upgrade head first
Script not copied into container: Verify the docker cp step completed successfully
PYTHONPATH not set: Always include -e PYTHONPATH=/app when running scripts inside the container

Agent Issues

LLM Timeout or No Response

Symptoms: Chat messages do not get responses. The UI spins indefinitely. Diagnosis:

# Check backend logs for chat/agent errors
kubectl logs -n tesslate deployment/tesslate-backend | grep -i "chat\|agent\|litellm"

# Check LiteLLM configuration
kubectl exec -n tesslate deployment/tesslate-backend -- env | grep LITELLM

Common causes:

Missing API key: Verify LITELLM_API_BASE and LITELLM_MASTER_KEY are set
Rate limiting: Check logs for rate limit errors; implement exponential backoff
Model not available: Verify the model name in LITELLM_DEFAULT_MODELS is correct and accessible

Tool Execution Failures

Symptoms: Agent tool calls fail. Logs show tool execution errors. Diagnosis:

kubectl logs -n tesslate deployment/tesslate-backend | grep -i "tool\|execute"

Common causes:

Container not running: The user project container must be started before the agent can execute file or shell operations
File path issues: Tool file paths are relative to the project root; verify the expected file exists
Permission denied: Check that the container user has write access to the target directory

Streaming Errors

Symptoms: Agent responses cut off mid-stream or the SSE connection drops. Common causes:

Proxy timeout: NGINX Ingress default timeouts may be too short for long agent runs. Ingress annotations should set proxy-read-timeout and proxy-send-timeout to 3600
Client-side EventSource disconnect: Ensure the frontend properly handles reconnection
Backend exception during streaming: Check backend logs for tracebacks during the stream

Deployment Issues (External Providers)

SSL Certificate Not Valid

Symptoms: Browser shows certificate warning when accessing the application. Diagnosis:

# Check certificate status
kubectl get certificate -n tesslate
kubectl describe certificate tesslate-wildcard-tls -n tesslate

# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager --tail=50

Common causes:

DNS not propagated: Wait up to 48 hours for DNS propagation
Cloudflare API token invalid: The token needs Zone:Zone:Read and Zone:DNS:Edit permissions
Wildcard cert subdomain limitation: *.domain.com only covers one level. foo.bar.domain.com requires a separate cert or Cloudflare proxy

Domain Routing (503 Service Unavailable)

Symptoms: Browser shows 503 error when accessing the application or a user project. Diagnosis:

# Check pod readiness
kubectl get pods -n tesslate

# Check service endpoints
kubectl get endpoints -n tesslate

# Check ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=50

Solutions:

Pod not ready: Wait for the pod to pass readiness checks, or check why it is failing
Service endpoint stale: Restart the ingress controller: kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx
Ingress misconfigured: Inspect with kubectl describe ingress -n tesslate

CORS Errors

Symptoms: Browser console shows Access to fetch has been blocked by CORS policy. Solutions:

Verify APP_DOMAIN in backend config matches your frontend origin
Check the DynamicCORSMiddleware in main.py includes the correct URL patterns
Ensure both HTTP and HTTPS origins are allowed if your setup uses both

Docker Issues

Image Not Updating After Rebuild

Symptoms: Code changes do not appear after rebuilding and redeploying. Root cause: Docker (and Minikube) caches images and does not overwrite existing images with the same tag. Solution (Minikube):

# 1. Delete old image from Minikube
minikube -p tesslate ssh -- docker rmi -f tesslate-backend:latest

# 2. Rebuild with --no-cache
docker rmi -f tesslate-backend:latest
docker build --no-cache -t tesslate-backend:latest -f orchestrator/Dockerfile orchestrator/

# 3. Load to Minikube
minikube -p tesslate image load tesslate-backend:latest

# 4. Delete pod to force restart
kubectl delete pod -n tesslate -l app=tesslate-backend

Volume Permission Errors

Symptoms: Container fails to read or write files. Logs show “Permission denied.” Common causes:

Wrong user inside container: Ensure the container user (1000:1000) owns the project files
Host filesystem permissions: On Linux, Docker volumes may inherit restrictive host permissions
Windows line endings: Files created on Windows may cause script execution failures inside Linux containers

Network Conflicts

Symptoms: Containers cannot communicate. Port conflicts on the host. Diagnosis:

docker network ls | grep tesslate
docker port <container-name>

Solutions:

Stop conflicting services on the host that use the same ports (5432, 8000, 5173)
Ensure the project network is connected to Traefik: check the Compose file for network configuration

Kubernetes Issues

Minikube Image Caching

Problem: minikube image load does not overwrite existing images with the same tag. Solution: Always delete the old image before loading the new one:

minikube -p tesslate ssh -- docker rmi -f <image>:latest
minikube -p tesslate image load <image>:latest

NGINX Ingress Configuration

Problem: Ingress returns 503 or the wrong backend. Diagnosis:

kubectl get ingress -n tesslate -o yaml
kubectl describe ingress <name> -n tesslate
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=50

Common fixes:

Restart the ingress controller after backend deployments: kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx
Verify service selectors match pod labels
Check that the ingress class annotation matches your controller

PVC Not Bound

Symptoms: Pod stuck in Pending with event “unbound PersistentVolumeClaims.” Diagnosis:

kubectl get pvc -n proj-<uuid>
kubectl describe pvc project-storage -n proj-<uuid>

Common causes:

StorageClass not found: Verify K8S_STORAGE_CLASS matches an available StorageClass: kubectl get sc
No available PersistentVolumes: The dynamic provisioner may not be configured
Pod affinity violation: All pods sharing a RWO PVC must be on the same node. Check for affinity constraint failures

VolumeSnapshot Hibernation Failures

Symptoms: Hibernation fails with “snapshot not ready” or “snapshot creation failed.” Diagnosis:

# Check backend logs
kubectl logs -n tesslate deployment/tesslate-backend | grep -i snapshot

# Check VolumeSnapshot status
kubectl get volumesnapshot -n proj-<uuid>
kubectl describe volumesnapshot <name> -n proj-<uuid>

# Check snapshot controller
kubectl logs -n kube-system -l app=snapshot-controller

Common causes:

VolumeSnapshotClass not configured: Ensure tesslate-ebs-snapshots exists: kubectl get volumesnapshotclass
EBS CSI driver not installed: The snapshot feature requires the AWS EBS CSI driver with snapshot support
PVC does not exist or is not bound: Verify the PVC is in Bound state before attempting a snapshot

Quick Diagnostic Commands

# Overall cluster health
kubectl get pods --all-namespaces
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -20

# Tesslate-specific
kubectl get pods -n tesslate -o wide
kubectl logs -n tesslate deployment/tesslate-backend --tail=100
kubectl logs -n tesslate deployment/tesslate-frontend --tail=100

# User project namespaces
kubectl get pods --all-namespaces | grep proj-
kubectl get ingress --all-namespaces | grep proj-

# Resource usage
kubectl top pods -n tesslate
kubectl top nodes

# Network
kubectl get svc -n tesslate
kubectl get endpoints -n tesslate

Getting Help

If you cannot resolve an issue:

Collect diagnostic information

kubectl get pods -n tesslate -o yaml > pods.yaml
kubectl logs -n tesslate deployment/tesslate-backend --tail=500 > backend.log
kubectl describe pods -n tesslate > describe.txt

Search the codebase

Search for the error message in the source code. Many errors have comments explaining the root cause and fix.

Create a detailed issue report

Include: steps to reproduce, expected vs. actual behavior, relevant logs, configuration, and environment details (Minikube/AWS, versions).

​Overview

​Container Issues

​Devserver Image Missing

​ImagePullBackOff

​Pod Stuck in CrashLoopBackOff

​Namespace Stuck in Terminating

​Database Issues

​Connection Refused

​Migration Errors

​Database Seeding Failures

​Agent Issues

​LLM Timeout or No Response

​Tool Execution Failures

​Streaming Errors

​Deployment Issues (External Providers)

​SSL Certificate Not Valid

​Domain Routing (503 Service Unavailable)

​CORS Errors

​Docker Issues

​Image Not Updating After Rebuild

​Volume Permission Errors

​Network Conflicts

​Kubernetes Issues

​Minikube Image Caching

​NGINX Ingress Configuration

​PVC Not Bound

​VolumeSnapshot Hibernation Failures

​Quick Diagnostic Commands

​Getting Help

Overview

Container Issues

Devserver Image Missing

ImagePullBackOff

Pod Stuck in CrashLoopBackOff

Namespace Stuck in Terminating

Database Issues

Connection Refused

Migration Errors

Database Seeding Failures

Agent Issues

LLM Timeout or No Response

Tool Execution Failures

Streaming Errors

Deployment Issues (External Providers)

SSL Certificate Not Valid

Domain Routing (503 Service Unavailable)

CORS Errors

Docker Issues

Image Not Updating After Rebuild

Volume Permission Errors

Network Conflicts

Kubernetes Issues

Minikube Image Caching

NGINX Ingress Configuration

PVC Not Bound

VolumeSnapshot Hibernation Failures

Quick Diagnostic Commands

Getting Help