0.0
Purity %
heating the vessel ......... OK
bgp sessions .............. ESTABLISHED
crystallizing infra ....... 99.1% PURE
respect the chemistry ..... ALWAYS
mahith.vengama ............ LIVE
SENIOR DEVOPS ENGINEER — AWS · AZURE · TERRAFORM · AI OPS · LET'S COOK

25MaHITH 23VeNGAMA

I run samsung.com's cloud — 135 microservices, two clouds, one private backbone. I am the one who deploys. And the infrastructure below? It's running. 99.1% pure.

01

The lab

Not screenshots. Every panel below is my actual work, cooking in front of you next to the config that runs it in production.

Canary deployment

Native ECS canary — a small batch ships first, purity-checked, then the full cook goes out.

canary 0% TRAFFIC · ALB · WEIGHTED v1 — stable batch v2 — small batch
deployment_configuration { strategy = "CANARY" bake_time_in_minutes = 5 # circuit-breaker rollback armed }

BGP failover

The primary line ruptures. Flow reroutes to the reserve. Nobody notices.

primary · flowing AWS AZURE DIRECT CONNECT · PRIMARY LINE EXPRESSROUTE · RESERVE LINE
bgp { session = "ebgp-multihop" hold_time = 30 # failover < 1s observed }

Peak-event autoscaling

K6-modeled heat. Vessels rack in as the reaction climbs — Black Friday, handled.

reaction temp · vessels 4
scale-out threshold
target_tracking { metric = "ECSServiceAverageCPU" target = 60 min = 4 max = 32 }

AI incident triage

My Bedrock agent runs the purity check. Every claim cites an observation.

pure p95 slo
ALERT purity drop — p95 latency > 800ms
OBS-1 Azure PG failover event 14:02Z
OBS-2 conn pool saturation follows OBS-1
VERDICT root cause: db failover — cites OBS-1,2
verdict = agent.investigate(alert) # read-only · grounded · no citation, no claim

Terraform, first

Console is for experiments. Everything that persists is code.

$ terraform apply # let's cook
Plan: 64 to add, 0 to change, 0 to destroy
module.ecs_service: creating...
Apply complete! 64 resources crystallized.
backend "remote" { organization = "…" # TFC · OIDC · zero static creds }

140+ alerts. All owned.

A rack of vials. They flare, get claimed, get neutralized. Noise isn't allowed in this lab.

unowned: 0
alert { owner = required } # or it gets deleted
02

The product

/01

AI Ops Agent

Bedrock · Grounded AI

A read-only, hypothesis-driven incident triage agent on AWS Bedrock. Wired to Teams, Grafana, and live AWS resources across a multi-account org; validated fixture-first with an eval harness and negative-control gates before touching production.

/02

Trivue

Cloudflare Workers · Product

An AI-native prediction platform built end-to-end on Cloudflare Workers. Serverless at the edge, zero infrastructure to babysit. This site lives on the same edge → usetrivue.com

/03

Native ECS Canary

Terraform · Delivery

Replaced CodeDeploy blue/green with ECS's native canary strategy in pure Terraform — progressive traffic shifting, circuit-breaker rollback, deployment-failure alerting, expand-contract DB migrations.

/04

Account Factory

AFT · TFC · OIDC

Multi-account vending with AFT, Terraform Cloud with OIDC — zero static credentials — and a layered module library. Zero IaC to Terraform-first as a hard rule.

/05

Observability

Grafana · OpenSearch · K6

Monitoring from scratch — Grafana, Prometheus, Sentry, ELK→OpenSearch across log and data clusters. 140+ alert rules with named owners, SLAs on findings, K6-backed capacity plans.

03

Lab notes

Native ECS canary deployments with TerraformMedium Direct Connect ↔ ExpressRoute over Equinix FabricMedium A grounded AI ops agent that never touches prodMedium All articles →@mkreddy9989
04

The empire business

Jan 2022 — Present

Samsung Electronics America · DevOps Engineer

Own cloud infra, security, and ops for samsung.com — 135 microservices on ECS. Terraform-first with TFC, OIDC, and AFT; native canary delivery, hybrid AWS–Azure networking over Equinix, the observability platform, an AI ops agent on Bedrock. Three annual peak events, zero downtime.

Aug 2021 — Jan 2022

Alten USA · Azure Cloud Engineer

ExpressRoute with Private and Microsoft Peering, BGP routing and failover. Terraform modules for AKS, networking, security; ExpressRoute over Equinix Fabric.

Oct 2020 — Jul 2021

DevRabbit IT Solutions · Cloud & DevOps Engineer

CI/CD with CodePipeline and CodeBuild, CloudFront traffic optimization, Ansible, Lambda-driven automation.

Feb 2019 — May 2020

Texas A&M University · Graduate Research Assistant

Event registration system processing 5,000+ registrations; M.S. Management Information Systems, 2020.

May 2016 — Dec 2018

Kirby Building Systems · Build & Release Engineer

Docker on Kubernetes with Helm, Jenkins push-button releases QA→prod, secrets with Vault.

M.S. MIS — Texas A&MB.Tech — Anna UniversityAWS Solutions Architect — AssociateAzure Fundamentals
05

Tread lightly

A live cluster. Kill services. Nuke the database. It self-heals every time — I respect the chemistry.

uptime 100.00%
traffic 4,183 req/s
your kills 0
mttr
status CHEMISTRY RESPECTED
⚡ FAILOVER — replica promoted in 830ms
Best effort so far: 0 kills · still 99.1% pure. Annoying, right?
You just tried to take my production down. It didn't even blink.

Say my name.

[email protected]
You're goddamn right.
LinkedIn Medium GitHub Resume