Integration¶

Ce chapitre detaille l'instrumentation de chaque service du catalogue pour alimenter la stack LGTM en metriques, logs et traces.

Vue d'ensemble des integrations¶

Service	Metriques	Logs	Traces	Dashboard
Keycloak	Endpoint `/metrics` natif	Logs JSON stdout	OTel SDK Java	Keycloak Overview
CoreDNS	Endpoint `/metrics` natif	Logs structures	Non (pas de traces DNS)	CoreDNS Overview
CI/CD (Gitea/ArgoCD)	Endpoints `/metrics`	Logs JSON	OTel spans (pipelines)	CI/CD Pipelines
Applications metier	OTel SDK	Logs JSON structures	OTel SDK (auto-instrumentation)	Service Overview
Infrastructure (nodes)	node_exporter	Journal systemd	Non	Node Overview
Conteneurs (Podman/K8s)	cAdvisor / kubelet	Logs conteneurs	Non	Container Overview

Metriques Keycloak¶

Keycloak expose des metriques Prometheus nativement :

# Configuration Keycloak pour exposer les metriques
# Variable d'environnement ou argument de demarrage
KC_METRICS_ENABLED: "true"
KC_HEALTH_ENABLED: "true"

# Scrape config Prometheus / Alloy
scrape_configs:
  - job_name: "keycloak"
    metrics_path: /metrics
    scheme: https
    tls_config:
      insecure_skip_verify: false
      ca_file: /etc/prometheus/ca.pem
    static_configs:
      - targets: ["keycloak.entreprise:8443"]
    relabel_configs:
      - source_labels: [__name__]
        regex: "keycloak_(login|token|registration)_.*"
        action: keep

Metriques cles a surveiller :

Metrique	Type	Signification
`keycloak_logins_total`	Counter	Nombre total de connexions
`keycloak_login_errors_total`	Counter	Echecs d'authentification
`keycloak_registrations_total`	Counter	Inscriptions
`keycloak_request_duration_seconds`	Histogram	Latence des requêtes
`keycloak_active_sessions`	Gauge	Sessions actives

# Taux d'echec d'authentification (alerte si > 10%)
rate(keycloak_login_errors_total[5m])
/
rate(keycloak_logins_total[5m])
> 0.1

Metriques CoreDNS¶

CoreDNS expose des metriques via le plugin prometheus :

# Corefile — activer les metriques
.:53 {
    prometheus :9153
    forward . 8.8.8.8 8.8.4.4
    cache 30
    log
    errors
}

Metriques cles :

Metrique	Type	Signification
`coredns_dns_requests_total`	Counter	Requêtes DNS par type (A, AAAA, CNAME)
`coredns_dns_responses_total`	Counter	Réponses par code (NOERROR, NXDOMAIN, SERVFAIL)
`coredns_dns_request_duration_seconds`	Histogram	Latence de résolution
`coredns_cache_hits_total`	Counter	Hits du cache DNS
`coredns_cache_misses_total`	Counter	Misses du cache DNS

# RED pour CoreDNS
# Rate
sum(rate(coredns_dns_requests_total[5m]))

# Errors
sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]))

# Duration (p99)
histogram_quantile(0.99, sum by(le) (rate(coredns_dns_request_duration_seconds_bucket[5m])))

Metriques CI/CD¶

Gitea¶

# Activer les metriques dans app.ini
[metrics]
ENABLED = true
TOKEN = ${GITEA_METRICS_TOKEN}

# Scrape config
- job_name: "gitea"
  metrics_path: /metrics
  bearer_token_file: /etc/prometheus/gitea-token
  static_configs:
    - targets: ["gitea.chaine-logicielle:3000"]

ArgoCD¶

# ArgoCD expose des metriques sur 3 endpoints
- job_name: "argocd-server"
  static_configs:
    - targets: ["argocd-server-metrics.cicd:8083"]

- job_name: "argocd-repo-server"
  static_configs:
    - targets: ["argocd-repo-server.cicd:8084"]

- job_name: "argocd-app-controller"
  static_configs:
    - targets: ["argocd-application-controller-metrics.cicd:8082"]

Metriques cles CI/CD :

Metrique	Signification
`argocd_app_sync_total`	Nombre de synchronisations
`argocd_app_health_status`	Sante des applications
`gitea_builds_total`	Nombre de builds
Pipeline duration	Duree des pipelines

Logs applicatifs structures¶

Format standard JSON¶

Tous les services doivent emettre des logs en JSON avec un schema coherent :

{
  "timestamp": "2026-04-16T14:23:45.123Z",
  "level": "info",
  "service": "catalog-api",
  "version": "1.4.2",
  "environment": "production",
  "trace_id": "abc123def456",
  "span_id": "789ghi012",
  "message": "requete traitee",
  "http": {
    "method": "GET",
    "path": "/api/v1/products",
    "status_code": 200,
    "duration_ms": 42,
    "client_ip": "10.20.1.45"
  }
}

Configuration Alloy pour la collecte des logs¶

// Collecte des logs de conteneurs Kubernetes
loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [loki.process.default.receiver]
}

// Pipeline de traitement des logs
loki.process "default" {
  // Parser les logs JSON
  stage.json {
    expressions = {
      level     = "level",
      service   = "service",
      trace_id  = "trace_id",
      timestamp = "timestamp",
    }
  }

  // Ajouter les labels
  stage.labels {
    values = {
      level   = "",
      service = "",
    }
  }

  // Filtrer les PII (voir chapitre Confidentialite)
  stage.replace {
    expression = "(password|token|secret)=\\S+"
    replace    = "${1}=***REDACTED***"
  }

  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

Labels Loki

Limiter le nombre de labels Loki (< 15 par flux). Les labels a haute cardinalite (user_id, request_id) ne doivent jamais etre des labels Loki — utiliser le filtrage dans le contenu du log via | json | user_id="xxx".

Traces distribuees¶

Auto-instrumentation avec OpenTelemetry¶

Pour une application Java (Spring Boot) :

# Dockerfile — ajout de l'agent OTel
FROM eclipse-temurin:21-jre
COPY app.jar /app.jar
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /otel-agent.jar
ENV JAVA_TOOL_OPTIONS="-javaagent:/otel-agent.jar"
ENV OTEL_SERVICE_NAME="catalog-api"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://alloy:4317"
ENV OTEL_TRACES_SAMPLER="parentbased_traceidratio"
ENV OTEL_TRACES_SAMPLER_ARG="0.1"
ENTRYPOINT ["java", "-jar", "/app.jar"]

Pour une application Python (Flask/FastAPI) :

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Demarrage avec auto-instrumentation
OTEL_SERVICE_NAME=user-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317 \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py

Pour une application Go :

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("alloy:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Dashboards Grafana par service¶

Structure des dossiers Grafana¶

Grafana Dashboards/
├── Infrastructure/
│   ├── Node Overview
│   ├── Container Overview
│   └── MinIO Overview
├── Services de production/
│   ├── Keycloak Overview
│   ├── CoreDNS Overview
│   └── Observabilite (meta-monitoring)
├── Chaine logicielle/
│   ├── Gitea Overview
│   ├── ArgoCD Overview
│   └── CI/CD Pipelines
└── Applications/
    ├── Service Overview (template)
    └── [Par application metier]

Dashboard template "Service Overview"¶

Chaque service doit avoir un dashboard avec les golden signals :

Panel	Requête PromQL	Type
Request rate	`sum(rate(http_requests_total{service="$service"}[5m]))`	Graph
Error rate	`sum(rate(http_requests_total{service="$service",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="$service"}[5m]))`	Gauge
Latency p99	`histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m])))`	Graph
Saturation	`process_resident_memory_bytes{service="$service"} / container_spec_memory_limit_bytes{service="$service"}`	Gauge
Recent logs	`{service="$service"} \\| json \\| level=~"error\\|warn"`	Logs panel
Active traces	Lien vers Tempo avec filtre `service.name="$service"`	Link

Dashboards as code

Les dashboards Grafana doivent etre versionnees dans Git (JSON export) et provisionnees automatiquement via les ConfigMaps Kubernetes ou le provisioning Grafana. Ne jamais créer de dashboard uniquement via l'interface.