
If you have not yet run OpenSail on a local Kubernetes cluster, start with the Local Kubernetes guide first. The cloud topology is the same; AWS just adds managed nodes, ECR, and real TLS.
1. What you will deploy
EKS cluster
Kubernetes 1.35 on managed node groups (on-demand plus spot), OIDC provider for IRSA, control-plane logs to CloudWatch.
ECR
Repositories in account 859561299901 (us-east-1) for
tesslate-backend, tesslate-frontend, tesslate-devserver, tesslate-ast, tesslate-btrfs-csi, and seeded app images.S3 + IAM (IRSA)
Encrypted bucket for project hibernation and CAS. IRSA roles for backend, EBS CSI, cluster-autoscaler, external-dns, cert-manager.
NGINX Ingress + NLB
AWS Network Load Balancer fronts NGINX Ingress. All HTTP traffic terminates here before hitting workloads.
cert-manager + Cloudflare
Let’s Encrypt wildcard via DNS01 against Cloudflare. Terraform creates the apex and
*.domain records on the NLB hostname.btrfs CSI + Volume Hub
Per-project btrfs subvolumes with CAS sync to S3 and Volume Hub orchestration. Same stack as local minikube.
10.0.0.0/16, three public plus three private subnets across three AZs), a NAT gateway, LiteLLM (with optional RDS backend), and the application workloads in the tesslate namespace (backend, frontend, worker, Redis, Postgres or external RDS, cleanup CronJobs).
The shared stack (k8s/terraform/shared/) provisions cross-environment resources: ECR repos, a small tesslate-platform-eks cluster for internal tools (Headscale VPN and friends), and platform-level NGINX Ingress plus cert-manager plus Cloudflare DNS.
2. Prerequisites
AWS account + CLI
AWS CLI v2.
aws sts get-caller-identity must succeed against the target account.Terraform >= 1.5
Provider pinning handled in
main.tf. State is stored in S3, keyed per environment.kubectl + Helm
kubectl matching EKS 1.35. Helm v3 for running cert-manager or ingress charts manually.
Docker with buildx
Needed for
linux/amd64 builds. Apple Silicon still works because buildx cross-compiles.Cloudflare API token
Token scoped to
Zone:DNS:Edit and Zone:Zone:Read on the target zone.ECR push access
IAM user in
eks_admin_iam_arns, or a named team role with push permissions.- Ability to create VPC, EKS, IAM, S3, and ECR resources (direct or via assumed role).
- An IAM user listed in
eks_admin_iam_arnsfor the target environment, or membership in a team IAM group. See the EKS access model section for details. - AWS Secrets Manager access to
tesslate/terraform/{production,beta,shared}for pulling tfvars.
Windows / MSYS: prefix
kubectl and docker exec with MSYS_NO_PATHCONV=1 so Git Bash does not rewrite paths.3. First-time provisioning
Apply the shared stack
Apply beta first
aws-deploy.sh helper auto-detects backend drift: if your local .terraform/terraform.tfstate points at the wrong environment, it reinitializes with the correct backend HCL before running plan or apply.
4. Environments: beta vs production
- Beta
- Production
| Field | Value |
|---|---|
| Terraform state key | beta/terraform.tfstate |
| Backend config | backend-beta.hcl |
| tfvars file | terraform.beta.tfvars |
| Secrets Manager entry | tesslate/terraform/beta |
| Kustomize overlay | k8s/overlays/aws-beta/ |
| ECR tag | :beta |
| kubectl context | tesslate-beta-eks |
5. Secrets management (envFrom auto-sync)
Three Kubernetes secrets in thetesslate namespace are fully terraform-managed from k8s/terraform/aws/kubernetes.tf:
tesslate-app-secrets: app-level config (APP_DOMAIN,LITELLM_MASTER_KEY, OAuth client secrets, Stripe keys, SMTP, PostHog, etc.)postgres-secret: Postgres credentialss3-credentials: S3 bucket config (the backend pod uses IRSA for auth, so no static AWS keys land in the secret)
envFrom. This is the auto-sync half of the pattern: every key added to a terraform-managed secret is available as a pod env var on the next rollout, with no kustomize edit required.
The other half is explicit env entries in k8s/overlays/aws-base/backend-patch.yaml. Those entries live under a $patch: replace directive so the base manifest’s env array is wiped and only static values plus one alias (K8S_INGRESS_DOMAIN to APP_DOMAIN) remain. Without $patch: replace, stale base entries would merge back in.
Rotating a secret
reload step rolls pods so they pick up the new secret values.
6. EKS access
EKS uses a role-based model. Regular humans assume one of four team roles. Terraform, CI, and a small list of named admins (tesslate-terraform, tesslate-bigboss) assume the eks-deployer role.
| I want to… | Role | ARN pattern |
|---|---|---|
kubectl logs, get, describe, read CloudWatch logs, browse ECR | team-observer | arn:aws:iam::859561299901:role/tesslate-{env}-eks-team-observer |
Above, plus kubectl rollout, restart/patch deployments, push to ECR | team-deployer | arn:aws:iam::859561299901:role/tesslate-{env}-eks-team-deployer |
Above, plus kubectl exec, shell into pods, run debug containers | team-debugger | arn:aws:iam::859561299901:role/tesslate-{env}-eks-team-debugger |
| Above, plus Secrets Manager, RBAC, namespace mgmt, IAM team users | team-admin | arn:aws:iam::859561299901:role/tesslate-{env}-eks-team-admin |
Configure kubectl with the deployer role
Run this once per machine.
--role-arn bakes role assumption into the resulting kubeconfig, so every later kubectl call uses it.Team members who are not in
eks_admin_iam_arns should use named AWS CLI profiles with role_arn entries for the team role they need. Full onboarding flow (IAM groups, ~/.aws/config snippets, assume-role one-liner) lives in the EKS Cluster Access guide.aws-deploy.sh invokes aws eks update-kubeconfig under the hood with --role-arn arn:aws:iam::859561299901:role/tesslate-{env}-eks-eks-deployer every time it touches the cluster, so you do not have to rerun the commands above for its subcommands.
7. Build and push images
Six images live in ECR under account 859561299901 inus-east-1:
| Repository | Dockerfile | Purpose |
|---|---|---|
tesslate-backend | orchestrator/Dockerfile | FastAPI + ARQ worker |
tesslate-frontend | app/Dockerfile.prod | React + Vite SPA behind NGINX |
tesslate-devserver | orchestrator/Dockerfile.devserver | User project container base |
tesslate-ast | services/ast/Dockerfile | AST parser sidecar of the backend pod |
tesslate-btrfs-csi | services/btrfs-csi/Dockerfile | CSI driver + Volume Hub |
tesslate-markitdown, tesslate-deerflow | seeds/apps/.../Dockerfile | Seeded Tesslate Apps |
:production or :beta for first-class images; :latest for seeded app images.
build subcommand performs these steps:
git submodule update --init --recursive(the agent runner inpackages/tesslate-agentis copied into the backend image).aws ecr get-login-password | docker loginagainst859561299901.dkr.ecr.us-east-1.amazonaws.com.docker buildx build --platform linux/amd64 --pushin parallel across selected images.aws eks update-kubeconfigwith theeks-deployerrole for the target environment.kubectl apply -k k8s/overlays/aws-{env}to pick up any manifest changes.- Rolling restart of the impacted Deployments plus a parallel
kubectl rollout status --timeout=300s. - If the backend was rebuilt,
python -m scripts.seed_appsruns inside the backend pod to upsert the Tesslate Apps registry.
8. Deploy
9. Verify
Log in through the browser at
https://<your-domain>/ and confirm the dashboard loads and you can create a project.10. DNS and TLS
DNS and certificates are fully managed by terraform plus in-cluster controllers:- Cloudflare DNS records (
dns.tf) create CNAMEs for the apex domain and*.domainpointing at the NLB hostname, proxied through Cloudflare. - external-dns reconciles per-project subdomain records from Ingress annotations when users deploy preview projects.
- cert-manager runs a
ClusterIssuerthat uses the Cloudflare API token for DNS01 challenges, minting a wildcard Let’s Encrypt cert stored in thetesslate-wildcard-tlsSecret. Ingress resources reference it viaK8S_WILDCARD_TLS_SECRET. - Cloudflare SSL mode should be
Full (strict)so browser to edge and edge to NLB are both encrypted.
Ready=False for longer than ten minutes, check kubectl describe certificaterequest and cert-manager logs for Cloudflare API errors.
11. Seed the production database
Run this once after the initial terraform apply. Seeds upsert by slug, so running twice is safe but wasteful.alembic upgrade head inside the backend pod.
12. Scaling
Three independent layers:Pod replicas
k8s/overlays/aws-production/replicas-patch.yaml sets backend, frontend, worker, and ingress controller replica counts. Hotfix scale: kubectl --context=tesslate-production-eks scale deploy/tesslate-backend -n tesslate --replicas=4.HPA
metrics-server is installed by default (
enable_metrics_server = true). Add HorizontalPodAutoscaler per Deployment as needed.Cluster autoscaler
Installed via IRSA. On-demand node group scales between
eks_node_min_size and eks_node_max_size; spot up to eks_spot_max_size. User project workloads prefer the spot group.additional_node_groups in tfvars and apply. Schema lives in variables.tf.
The worker Deployment defaults to a single replica because in-memory task state is still partially colocated with the process. Scale it cautiously and watch Redis queue depth via the worker logs.
13. Observability
-
Control plane logs stream to CloudWatch at
/aws/eks/tesslate-{env}-eks/cluster: -
Workload logs:
kubectl --context=tesslate-production-eks logs -n tesslate deploy/tesslate-backend -f. -
Metrics:
kubectl --context=tesslate-production-eks top pods -n tesslateandkubectl top nodes. For historical data, installkube-prometheus-stackvia Helm or route metrics from the OpenTelemetry Collector to CloudWatch. - Structured logs + OpenTelemetry: see the Enterprise observability guide for deploying the OTel Collector, wiring exporters, and enabling the audit log stream.
14. Updates and migrations
Trigger deploy
Run the
Deploy Production workflow (.github/workflows/deploy-production.yml) in GitHub Actions. It downloads tesslate/terraform/production from Secrets Manager, runs terraform plan -detailed-exitcode, applies, then runs ./scripts/aws-deploy.sh deploy-k8s production followed by ./scripts/aws-deploy.sh build production.Or, from your workstation:rollout restart, so the Deployment is always serving traffic. For a hard cutover or a very large schema migration, drain traffic first using the safe shutdown procedure.
15. Rollback and safe shutdown
Rollback a single Deployment::production, or bump newTag in k8s/overlays/aws-production/kustomization.yaml to a specific SHA and reapply.
For planned downtime, maintenance windows, or draining user pods cleanly, follow the safe shutdown procedure in docs/guides/safe-shutdown-procedure.md on GitHub: stop user containers, scale the worker and backend to zero, pause the task queue, and only then apply the risky change.
16. Troubleshooting
AccessDenied: eks:DescribeCluster
AccessDenied: eks:DescribeCluster
Your IAM user is not in a team IAM group or
eks_admin_iam_arns. Assume a team role first (one-shot aws sts assume-role, or a named AWS profile with role_arn) before running any aws eks or kubectl command.error: You must be logged in to the server (Unauthorized)
error: You must be logged in to the server (Unauthorized)
ErrImagePull / unauthorized on a tesslate-* image
ErrImagePull / unauthorized on a tesslate-* image
no match for platform in manifest
no match for platform in manifest
Image was built on arm64. Rebuild with
docker buildx build --platform linux/amd64 --push. The build subcommand does this by default.Ingress returns 503
Ingress returns 503
Backend pod not Ready, or the ingress-nginx controller cache is stale after a backend restart.
Certificate stuck Ready=False
Certificate stuck Ready=False
Cloudflare API token missing DNS edit permission, or the zone is wrong. Verify the token has
Zone:Zone:Read and Zone:DNS:Edit on the correct zone, and check cert-manager logs:Frontend calls go to /api/api/...
Frontend calls go to /api/api/...
The
api-url ConfigMap includes a trailing /api. Set frontend_api_url = "https://opensail.tesslate.com" in tfvars (no /api) and reapply.Backend CrashLoopBackOff immediately
Backend CrashLoopBackOff immediately
Secret rotation broke a required key, or Confirm the key exists in
tesslate-app-secrets is missing a key consumed via envFrom.k8s/terraform/aws/kubernetes.tf and reapply.No module named 'tesslate_agent' in the backend
No module named 'tesslate_agent' in the backend
Git submodule was not initialized before
docker build. Run git submodule update --init --recursive, then rebuild. The build script handles this automatically.Volume Hub pods stuck Terminating
Volume Hub pods stuck Terminating
The CSI DaemonSet rolled at the same time as Hub. Wait for
tesslate-btrfs-csi-node rollout to stabilize; ./scripts/aws-deploy.sh build compute and reload volume-hub sequence these correctly.Orphaned proj-* namespaces
Orphaned proj-* namespaces
A project was deleted before its namespace drained.
terraform apply times out on Helm resources
terraform apply times out on Helm resources
cert-manager or external-dns is pending due to DNS. Apply with
-target=module.eks first, then rerun the full apply once the cluster is Ready.Next steps
Local Kubernetes
Reproduce production locally on minikube for debugging storage, ingress, or snapshot issues.
Docker Setup
Fastest inner-loop dev flow. Useful when you do not need Kubernetes in the loop.
Publishing Apps
Ship a Tesslate App to the marketplace from production.
Billing
Configure Stripe, tiers, and credit accounting for the hosted experience.
Getting help
Discord
Real-time help from the Tesslate community.
GitHub
Source, issues, and release notes.
Direct support at
[email protected].