Table of Contents
Last update: June 2026. All opinions are my own.
Why bother
A model in a notebook is a sketch. A model behind an HTTPS endpoint is a system someone can actually use.
Most of the data work I'd done before lived in the first category — train, evaluate, write up, close the laptop. Useful for learning, useless to anyone but me. This project was the opposite exercise: a small cloud migration. Take a model I'd already built (forest cover type prediction, the Kaggle classifier), pull every thread until it lifts off my laptop and runs as a real service on Microsoft Azure.
The model itself is the same one from the earlier post — a Random Forest on the cartographic features of the Roosevelt National Forest dataset, predicting one of seven tree cover types. The interesting part this time isn't the model. It's the system around it.
You can hit it right now:
curl https://forest-cover-endpoint.whiteflower-743727a8.northeurope.azurecontainerapps.io/health
# {"status":"ok"}It might take 5–10 seconds the first time — the container scales to zero when nobody's using it, so a cold start spins one up. Subsequent calls are instant.
The stack, and why each piece
I wanted the cheapest, simplest path that ticked the boxes any sensible engineering team would expect to see for a deployed ML service.
- FastAPI + Pydantic for the inference service. Type-checked request/response models, automatic OpenAPI docs, fast enough that the model is the bottleneck rather than the framework.
- scikit-learn + joblib for the actual model. The same
RandomForestClassifierfrom the notebook, serialised once at training time and loaded once at container start. - Docker to freeze the runtime. Python version, libraries, model weights, all in one immutable artifact.
- GitHub Actions for CI (lint + tests on every push) and a separate Deploy workflow that trains a fresh model, builds the image, pushes it to GHCR, and rolls it onto Azure on every merge to
main. - Azure Container Apps for the host. The killer feature:
--min-replicas 0means the app scales down to nothing when idle. A portfolio endpoint that's hit twice a week costs me roughly nothing. - OIDC federated credentials instead of stored secrets. GitHub Actions exchanges a workflow token for a short-lived Azure access token. There's no long-lived secret to rotate, leak, or forget about.
The architecture in one paragraph
A push to main triggers the Deploy workflow. The workflow installs the training deps, retrains the model end-to-end on a 50k row sample, runs the FastAPI tests, builds a Docker image containing the freshly trained model, pushes it to ghcr.io, logs into Azure via OIDC (no secrets stored), and tells the Container App to roll to the new image tag. The Container App's load balancer drains the old container, starts the new one, runs the health check, and the rollout is done. The whole thing takes about 6 minutes from git push to a new model serving real traffic.
🔑 The only thing stored in GitHub secrets is identifiers — client ID, tenant ID, subscription ID, resource group name, container app name. Nothing that can authenticate on its own. The actual auth is a short-lived federated token, minted per workflow run.
The four things that broke along the way
Most of the work was the usual provisioning. The interesting bit is what wasn't in the readme.
1. West Europe was closed for new customers
The plan was West Europe — closest, lowest latency, what every Azure tutorial defaults to. The Container Apps environment provisioning failed with:
Resource 'workspace-...' was disallowed by Azure: The selected region
is currently not accepting new customers.The Azure free tier is capacity-rationed per region. Whole regions go "closed to new" when they fill up. The fix was a 30-second az group delete + recreate in North Europe (Ireland). Latency from Madrid is barely worse. Lesson: pick your region last, not first.
2. Dockerfile copy order vs pip install .
The package's pyproject.toml lists both app/ and train/ as setuptools packages. The original Dockerfile copied just pyproject.toml and ran pip install . before copying the source. Egg-info couldn't find the package directories and the build crashed on:
error: package directory 'app' does not existThis is the classic Dockerfile efficiency-vs-correctness tension. Best practice is to install deps before copying source, so a code change doesn't bust the dep-install layer cache. But if your install resolves the project itself (not just its deps), the source has to be present. The right fix here was to bite the cache miss — copy everything, then install.
3. .dockerignore was hiding train/ from the build context
After the copy-order fix, COPY train ./train failed with "/train": not found. The .dockerignore had train/ blacklisted — sensible if the install only needed it for, say, model artifacts. Less sensible when pip install . resolves it as a runtime package. Removed the entry. The runtime image is ~50KB bigger.
4. azure/CLI@v2 isolated the OIDC session
This one took the longest to find. The deploy workflow uses azure/login@v2 for OIDC, then originally used azure/CLI@v2 to run az containerapp update. The login step logged "Subscription is set successfully. Azure CLI login succeeds by using OIDC." — and then the next step failed with:
ERROR: The containerapp '***' does not existThe container app did exist. The federated SP did have Contributor on the resource group. The login step did succeed. So what?
azure/CLI@v2 spawns its own Docker container to run the Azure CLI. The auth state from azure/login@v2 lives on the runner's host filesystem and doesn't reliably reach the inner container. The CLI inside the container had no active session, and asking it "find this container app" returned an empty result that the error message labels as "does not exist."
The fix was to drop azure/CLI@v2 and run az directly with a plain run: step. The host runner already has the CLI and the active login from the previous step. Smaller surface, less indirection, and the moment I added an az account show line right before the update, every future debug session gets the subscription identity printed clearly in the log.
What the deploy workflow looks like
The deploy job, condensed:
jobs:
build-and-deploy:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
id-token: write # OIDC
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- run: pip install -e '.[train]'
- run: python -m train.train --sample-size 50000 --n-estimators 200
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
push: true
tags: |
ghcr.io/maria-aguilera/forest-cover-endpoint:${{ github.sha }}
ghcr.io/maria-aguilera/forest-cover-endpoint:latest
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Update Container App image
env:
APP_NAME: ${{ secrets.AZURE_CONTAINERAPP_NAME }}
RG: ${{ secrets.AZURE_RESOURCE_GROUP }}
IMAGE_REF: ghcr.io/maria-aguilera/forest-cover-endpoint:${{ github.sha }}
run: |
az extension add --name containerapp --upgrade --only-show-errors
az containerapp update --name "$APP_NAME" \
--resource-group "$RG" \
--image "$IMAGE_REF"
- name: Smoke-test deployment
run: |
URL=$(az containerapp show \
--name "$APP_NAME" --resource-group "$RG" \
--query properties.configuration.ingress.fqdn -o tsv)
for i in 1 2 3 4 5; do
if curl -fsS "https://${URL}/health"; then echo OK && exit 0; fi
sleep 10
done
exit 1Hitting the endpoint
curl -X POST \
https://forest-cover-endpoint.whiteflower-743727a8.northeurope.azurecontainerapps.io/predict \
-H 'content-type: application/json' \
-d '{
"elevation": 2596,
"aspect": 51,
"slope": 3,
"horizontal_distance_to_hydrology": 258,
"vertical_distance_to_hydrology": 0,
"horizontal_distance_to_roadways": 510,
"hillshade_9am": 221,
"hillshade_noon": 232,
"hillshade_3pm": 148,
"horizontal_distance_to_fire_points": 6279,
"wilderness_area": "Rawah",
"soil_type": 29
}'Returns something like:
{
"cover_type": 2,
"cover_label": "Lodgepole Pine",
"probabilities": {
"Spruce/Fir": 0.025,
"Lodgepole Pine": 0.66,
"Aspen": 0.315,
"Ponderosa Pine": 0.0,
"Douglas-fir": 0.0,
"Cottonwood/Willow": 0.0,
"Krummholz": 0.0
},
"model_version": "20260618-194508"
}The model_version is the UTC timestamp of when this image was trained. Two different deploys will return different versions even if the predictions are identical — useful for debugging after a rollback.
The cost
Realistic expected monthly cost: €0–2 with the container scaling to zero. Worst case if scale-to-zero hiccups: maybe €30/month. There's a budget alert at €5 to catch anything weird before it becomes anything more interesting.
What I'd do differently next time
- Pin the Container Apps base image in the workflow rather than relying on
azurecontainerapps-helloworld:latestfor the placeholder. The placeholder is fine for a first deploy, butlatestis mutable, which means the "before" state in a rollback isn't reproducible. - Move training out of the deploy workflow. Right now every push retrains. That's fine while iterating, but in a real system training is its own thing, with its own cadence and its own artifact registry. The deploy workflow should just pick up the latest validated model.
- Add structured logging and a
/metricsendpoint. Without observability I'm flying blind once the container starts. Even basic request counts and latency p99 would be a big upgrade.
What this project unlocks
A handful of things I can now do end-to-end that I couldn't before:
- MLOps and model deployment. Train a scikit-learn classifier, package it in Docker, and roll it out to a managed cloud service via a CI gate — without manually copying artifacts or restarting servers.
- Microsoft Azure end-to-end. Container Apps, Log Analytics, role-scoped IAM, and Consumption budgets — provisioned through the Azure CLI and version-controlled in
infra/README.md. - Cloud migration patterns. Lifting a local-only artifact into a managed cloud service with reproducible deployments and a clean rollback path.
- Secret-free CI/CD. GitHub Actions federated to an Azure AD app registration via OIDC — no long-lived credentials stored anywhere.
The non-skill thing I'm taking away: the model is the easy part. Six minutes of every deploy is spent re-fitting the classifier. The rest was getting the model to ship reliably. Most ML coursework treats the algorithm as the deliverable, which turns out to be backwards once a real user has to talk to it.
The code, including the infra/README.md walkthrough, is on GitHub:
github.com/maria-aguilera/forest-cover-endpoint.
