The content discusses the high costs associated with CI (Continuous Integration) failures in a monorepo pipeline running on self-hosted EC2 instances, specifically detailing that over 1,300 runs of the CI, only about 26% were successful. The workflow includes several parallel jobs such as Docker image builds and integration tests, which collectively consume significant compute resources even when individual job times seem relatively short. Despite a 43-minute average wall-clock time per run, the actual compute time across all parallel jobs is around 10 hours 54 minutes, leading to substantial wasted compute resources from failed or cancelled runs—equivalent to almost half of the total computing time being unproductive.
- AWS EC2 2xlarge instances
- Docker (all versions)
- CI/CD tools like Jenkins or GitHub Actions
- Implement more efficient CI job parallelization by analyzing and optimizing the workflow to reduce unnecessary concurrency.
- Upgrade to a newer version of Docker if applicable, ensuring it has optimizations for CI environments.
- Pin specific versions in your pipeline configuration files (e.g., Jenkinsfile or GitHub Actions YAML) to avoid running on outdated software that may contribute to failures.
This issue can significantly impact homelab stacks using similar CI/CD tools and Docker, especially if they rely on parallel job execution. Ensuring configurations are optimized for the specific tasks performed in CI can reduce waste.