Aller au contenu

Integration

Ce chapitre detaille l'instrumentation de chaque service du catalogue pour alimenter la stack LGTM en metriques, logs et traces.

Vue d'ensemble des integrations

Service Metriques Logs Traces Dashboard
Keycloak Endpoint /metrics natif Logs JSON stdout OTel SDK Java Keycloak Overview
CoreDNS Endpoint /metrics natif Logs structures Non (pas de traces DNS) CoreDNS Overview
CI/CD (Gitea/ArgoCD) Endpoints /metrics Logs JSON OTel spans (pipelines) CI/CD Pipelines
Applications metier OTel SDK Logs JSON structures OTel SDK (auto-instrumentation) Service Overview
Infrastructure (nodes) node_exporter Journal systemd Non Node Overview
Conteneurs (Podman/K8s) cAdvisor / kubelet Logs conteneurs Non Container Overview

Metriques Keycloak

Keycloak expose des metriques Prometheus nativement :

# Configuration Keycloak pour exposer les metriques
# Variable d'environnement ou argument de demarrage
KC_METRICS_ENABLED: "true"
KC_HEALTH_ENABLED: "true"
# Scrape config Prometheus / Alloy
scrape_configs:
  - job_name: "keycloak"
    metrics_path: /metrics
    scheme: https
    tls_config:
      insecure_skip_verify: false
      ca_file: /etc/prometheus/ca.pem
    static_configs:
      - targets: ["keycloak.entreprise:8443"]
    relabel_configs:
      - source_labels: [__name__]
        regex: "keycloak_(login|token|registration)_.*"
        action: keep

Metriques cles a surveiller :

Metrique Type Signification
keycloak_logins_total Counter Nombre total de connexions
keycloak_login_errors_total Counter Echecs d'authentification
keycloak_registrations_total Counter Inscriptions
keycloak_request_duration_seconds Histogram Latence des requêtes
keycloak_active_sessions Gauge Sessions actives
# Taux d'echec d'authentification (alerte si > 10%)
rate(keycloak_login_errors_total[5m])
/
rate(keycloak_logins_total[5m])
> 0.1

Metriques CoreDNS

CoreDNS expose des metriques via le plugin prometheus :

# Corefile — activer les metriques
.:53 {
    prometheus :9153
    forward . 8.8.8.8 8.8.4.4
    cache 30
    log
    errors
}

Metriques cles :

Metrique Type Signification
coredns_dns_requests_total Counter Requêtes DNS par type (A, AAAA, CNAME)
coredns_dns_responses_total Counter Réponses par code (NOERROR, NXDOMAIN, SERVFAIL)
coredns_dns_request_duration_seconds Histogram Latence de résolution
coredns_cache_hits_total Counter Hits du cache DNS
coredns_cache_misses_total Counter Misses du cache DNS
# RED pour CoreDNS
# Rate
sum(rate(coredns_dns_requests_total[5m]))

# Errors
sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]))

# Duration (p99)
histogram_quantile(0.99, sum by(le) (rate(coredns_dns_request_duration_seconds_bucket[5m])))

Metriques CI/CD

Gitea

# Activer les metriques dans app.ini
[metrics]
ENABLED = true
TOKEN = ${GITEA_METRICS_TOKEN}
# Scrape config
- job_name: "gitea"
  metrics_path: /metrics
  bearer_token_file: /etc/prometheus/gitea-token
  static_configs:
    - targets: ["gitea.chaine-logicielle:3000"]

ArgoCD

# ArgoCD expose des metriques sur 3 endpoints
- job_name: "argocd-server"
  static_configs:
    - targets: ["argocd-server-metrics.cicd:8083"]

- job_name: "argocd-repo-server"
  static_configs:
    - targets: ["argocd-repo-server.cicd:8084"]

- job_name: "argocd-app-controller"
  static_configs:
    - targets: ["argocd-application-controller-metrics.cicd:8082"]

Metriques cles CI/CD :

Metrique Signification
argocd_app_sync_total Nombre de synchronisations
argocd_app_health_status Sante des applications
gitea_builds_total Nombre de builds
Pipeline duration Duree des pipelines

Logs applicatifs structures

Format standard JSON

Tous les services doivent emettre des logs en JSON avec un schema coherent :

{
  "timestamp": "2026-04-16T14:23:45.123Z",
  "level": "info",
  "service": "catalog-api",
  "version": "1.4.2",
  "environment": "production",
  "trace_id": "abc123def456",
  "span_id": "789ghi012",
  "message": "requete traitee",
  "http": {
    "method": "GET",
    "path": "/api/v1/products",
    "status_code": 200,
    "duration_ms": 42,
    "client_ip": "10.20.1.45"
  }
}

Configuration Alloy pour la collecte des logs

// Collecte des logs de conteneurs Kubernetes
loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [loki.process.default.receiver]
}

// Pipeline de traitement des logs
loki.process "default" {
  // Parser les logs JSON
  stage.json {
    expressions = {
      level     = "level",
      service   = "service",
      trace_id  = "trace_id",
      timestamp = "timestamp",
    }
  }

  // Ajouter les labels
  stage.labels {
    values = {
      level   = "",
      service = "",
    }
  }

  // Filtrer les PII (voir chapitre Confidentialite)
  stage.replace {
    expression = "(password|token|secret)=\\S+"
    replace    = "${1}=***REDACTED***"
  }

  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

Labels Loki

Limiter le nombre de labels Loki (< 15 par flux). Les labels a haute cardinalite (user_id, request_id) ne doivent jamais etre des labels Loki — utiliser le filtrage dans le contenu du log via | json | user_id="xxx".

Traces distribuees

Auto-instrumentation avec OpenTelemetry

Pour une application Java (Spring Boot) :

# Dockerfile — ajout de l'agent OTel
FROM eclipse-temurin:21-jre
COPY app.jar /app.jar
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /otel-agent.jar
ENV JAVA_TOOL_OPTIONS="-javaagent:/otel-agent.jar"
ENV OTEL_SERVICE_NAME="catalog-api"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://alloy:4317"
ENV OTEL_TRACES_SAMPLER="parentbased_traceidratio"
ENV OTEL_TRACES_SAMPLER_ARG="0.1"
ENTRYPOINT ["java", "-jar", "/app.jar"]

Pour une application Python (Flask/FastAPI) :

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Demarrage avec auto-instrumentation
OTEL_SERVICE_NAME=user-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317 \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py

Pour une application Go :

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("alloy:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Dashboards Grafana par service

Structure des dossiers Grafana

Grafana Dashboards/
├── Infrastructure/
│   ├── Node Overview
│   ├── Container Overview
│   └── MinIO Overview
├── Services de production/
│   ├── Keycloak Overview
│   ├── CoreDNS Overview
│   └── Observabilite (meta-monitoring)
├── Chaine logicielle/
│   ├── Gitea Overview
│   ├── ArgoCD Overview
│   └── CI/CD Pipelines
└── Applications/
    ├── Service Overview (template)
    └── [Par application metier]

Dashboard template "Service Overview"

Chaque service doit avoir un dashboard avec les golden signals :

Panel Requête PromQL Type
Request rate sum(rate(http_requests_total{service="$service"}[5m])) Graph
Error rate sum(rate(http_requests_total{service="$service",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="$service"}[5m])) Gauge
Latency p99 histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m]))) Graph
Saturation process_resident_memory_bytes{service="$service"} / container_spec_memory_limit_bytes{service="$service"} Gauge
Recent logs {service="$service"} \| json \| level=~"error\|warn" Logs panel
Active traces Lien vers Tempo avec filtre service.name="$service" Link

Dashboards as code

Les dashboards Grafana doivent etre versionnees dans Git (JSON export) et provisionnees automatiquement via les ConfigMaps Kubernetes ou le provisioning Grafana. Ne jamais créer de dashboard uniquement via l'interface.