The Observability Pipeline: Taming Your Observability Costs

In a previous post, we dissected the anatomy of a runaway observability bill. You start with a simple goal—like enabling APM—and suddenly you’re paying for indexed spans, log ingestion, custom metrics, and a half-dozen other add-ons. The “single pane of glass” becomes your financial black hole.

The root cause of this problem is a common architectural anti-pattern: shipping all your telemetry data directly from your applications to your observability vendor.

This direct-to-vendor approach is simple to set up, which is why it’s so tempting. But in doing so, you give up all control.

This direct-to-vendor approach is simple to set up, which is why it’s so tempting. You install the vendor’s agent, configure your services, and data flows. But in doing so, you give up all control. You are at the mercy of the vendor’s pricing model, paying to ingest and index every verbose debug log, every redundant health check, and every high-volume, low-value trace.

This model is broken, but there is a better way. It’s time to stop being a passive consumer and start being an active architect of your observability strategy. The tool for this job is a powerful architectural pattern that puts you back in the driver’s seat: the observability pipeline.

What is an Observability Pipeline?

An observability pipeline is a dedicated layer of infrastructure that sits between your applications (the data sources) and your observability backends (the destinations). Its job is to collect, process, and route all telemetry data—logs, metrics, and traces—before it ever reaches a billable service.

Think of it as a central post office for your system’s data. Instead of every application mailing its letters directly, they all send them to the post office. There, you can sort them, discard the junk mail, bundle letters going to the same place, and choose the most cost-effective shipping method for each package.

This pipeline is typically built using vendor-neutral tools like the OpenTelemetry Collector or high-performance stream processors like Vector or Redpanda Connect. These tools are designed for high-throughput data processing and can be configured to perform powerful transformations on your data in-flight.

A simplified view: Applications send data to the Collector, which then processes and routes it to different destinations like Datadog for hot storage and S3 for cold storage.

The Superpowers of a Pipeline: How It Slashes Costs

Implementing a pipeline isn’t just about adding another box to your architecture diagram. It’s about unlocking capabilities that are impossible in a direct-to-vendor model.

1. Aggressive Filtering and Sampling

This is the most immediate and impactful cost-saving feature. Not all data is worth paying for.

Log Filtering: A service in a steady state might generate thousands of DEBUG or INFO logs per minute that are useless 99.9% of the time. With a pipeline, you can create rules to drop these logs at the source. For example: “If the log level is INFO and the service is recommendation-engine, drop it.” You stop paying to ingest and index noise ( and send it for a cheap storage if needed).

Smart Trace Sampling: Do you need to store a detailed trace for every single successful 200 OK request? Absolutely not. A pipeline allows you to implement intelligent, tail-based sampling, where the decision to keep or drop a trace is made after all of its spans have been collected. This helps you keep the interesting stuff (errors, high-latency, rare paths) and throw away the boring, repetitive traces. It’s a compliance feature as much as a cost lever, as it reduces the volume of sensitive data leaving your network.

Here is a conceptual example using an OpenTelemetry Collector’s tail_sampling processor:

processors:
  tail_sampling/errors_and_slow:
    policies:
      # Keep 100% of traces that have an error
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Keep traces that take longer than 2 second
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 2000
      # Keep 10% of all other traces
      - name: probabilistic-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Cost Quotas and Circuit Breakers: A developer accidentally introduces an infinite loop that logs an error on every iteration in a staging environment. Without a pipeline, this one mistake could generate millions of logs and cost thousands of dollars before anyone notices. A pipeline can act as a financial circuit breaker by enforcing quotas. You can set rules like: “The auth-service-dev can send a maximum of 10,000 log events per hour to Datadog. After that, drop all further logs from that service for the rest of the hour.” This prevents a single buggy service from blowing up your bill.

You get full visibility into problems while dramatically reducing the volume of expensive indexed spans.

2. Smart Routing to Tiered Storage

Your most powerful cost-control lever is realizing that not all data needs to live in an expensive, instantly-queryable backend. A pipeline lets you route data based on its value.

The “Hot Path”: Critical data—error logs, traces from key user journeys, business transaction events—can be sent to your expensive, high-performance observability platform (like Datadog, Splunk, or New Relic). This is your “hot” storage for active incident response and real-time dashboards.
The “Cold Path”: High-volume, low-value data—debug logs, successful request logs, etc.—can be routed to cheap object storage like Amazon S3 or Google Cloud Storage. This data is still available for compliance or deep, offline analysis if needed (a process often called “rehydration”), but you’re not paying a premium to have it indexed 24/7.

This tiered approach can easily cut your logging and tracing bills by 50-90% without sacrificing the ability to investigate issues.

3. Enforce Privacy and Compliance by Design

A pipeline is the perfect place to clean and standardize your data, but its most critical role is as a security and compliance backstop. Every engineer has had that 3 a.m. moment: tailing logs during an incident and spotting a customer email, a bearer token, or other sensitive data.

PII Scrubbing and Redaction: In a direct-to-vendor model, that leaked PII is now in a third-party system, creating a security and compliance nightmare that violates regulations like GDPR. A pipeline allows you to enforce privacy by design. You can configure processors to detect and redact, hash, or delete sensitive data before it ever leaves your environment. This turns your pipeline into a non-negotiable policy enforcement point.

Here’s a conceptual example of an OpenTelemetry Collector configuration that scrubs common PII from logs and traces:

processors:
  # Processor to delete attributes with known PII keys
  attributes/pii_scrub:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: user.email
        action: delete
      - key: db.statement # SQL statements often contain user data
        action: delete

  # Processor to mask PII patterns in free-text log bodies
  transform/pii_mask:
    log_statements:
      - context: log
        statements:
          - set(body, replace_all_regex(body, `(?i)(email|e-mail)\s*[:=]\s*[^,\s]+`, "email=<redacted>"))
          - set(body, replace_all_regex(body, `(?i)(token|authorization)\s*[:=]\s*[\w\.\-]+`, "token=<redacted>"))

This configuration acts as a safety net, ensuring that even if a developer accidentally logs a sensitive value, the pipeline prevents it from being exposed.

Managing Sensitive Identifiers: As we’ve discussed in a previous post, passing an end-user identifier is critical for debugging and abuse mitigation. While a UUID or internal database ID isn’t the same as an email address (PII), it’s still a sensitive identifier under regulations like GDPR. A pipeline is the perfect place to manage this risk. You can use it to transform a potentially linkable internal ID into a different, stable, and pseudonymous identifier before it reaches a third party, preserving your ability to correlate activity without exposing internal data structures.
Centralized Enrichment: Instead of configuring every single application to add metadata like environment, region, or k8s_cluster, you can enrich the data as it flows through the pipeline. This simplifies application configuration and ensures consistent, standardized tagging across all your telemetry data.

4. The Human Side: “But I Need All the Data During an Incident!”

This is the most common and valid pushback. When the pager goes off at 3 a.m., an engineer’s first instinct isn’t “I hope we’re managing our compliance risk effectively.” It’s “I need to see everything, right now, to fix this.” They’re not wrong. Reducing Mean Time to Resolution (MTTR) often depends on having rich, detailed context.

The solution isn’t to take away the tools needed to fight fires. It’s to build a system with different modes for “peacetime” and “wartime.” This is the “break-glass observability” playbook.

Peacetime (Default Mode): 99.9% of the time, the pipeline operates in a cost-optimized and privacy-preserving state. It aggressively samples traces, redacts PII, and routes verbose logs to cheap storage.
Wartime (Incident Mode): When a high-severity incident is declared, an authorized engineer can trigger a “break-glass” procedure. This temporarily adjusts the pipeline’s rules for a specific, targeted part of the system (e.g., for one service, or for traffic related to a specific user ID). This might involve increasing the trace sampling rate to 100% or disabling PII redaction for a short, fixed period.

Crucially, this action is temporary, targeted, and audited. The switch automatically reverts after a set time, the scope is limited to only what’s necessary, and a log is created of who enabled it and why. You get the speed and detail you need to solve the incident, without paying the financial and compliance cost of “collect everything” 24/7.

5. Breaking Vendor Lock-In

As we’ve discussed when choosing tools, not vendors, proprietary agents and SDKs create deep lock-in. An observability pipeline, especially when paired with OpenTelemetry, is the ultimate antidote.

When your applications are instrumented with the vendor-neutral OpenTelemetry SDKs, they send data to your pipeline (the OpenTelemetry Collector), not to a specific vendor. The pipeline then forwards that data to the vendor of your choice.

This gives you immense power:

Evaluate Competitors Easily: Want to see if a new vendor is more cost-effective? Simply configure your pipeline to dual-send a fraction of your data to the new vendor for a live, real-world comparison. No need to re-instrument a single application.
Negotiate with Leverage: When your contract is up for renewal, your ability to switch vendors with a simple configuration change gives you a massive advantage in negotiations.
Use Multiple Backends: You can even send different types of data to different specialized tools. Maybe your traces go to Honeycomb, your logs go to a self-hosted Elasticsearch cluster, and your metrics go to Prometheus—all managed from one central pipeline. Totally agree that is no the same as having everything in a simple solution, but many times many do not need it.

While this doesn’t offer the same convenience as a single, integrated solution, many teams find the flexibility and control to be a worthwhile trade-off.

Getting Started: A Pragmatic Approach

You don’t need to boil the ocean. Start small.

Target a High-Volume Service: Identify a single, non-critical but high-volume service that is costing you a lot in log ingestion.
Deploy a Collector: Deploy an OpenTelemetry Collector (or Vector) and configure that one service to send its logs to the Collector instead of directly to your vendor.
Implement a Simple Rule: Configure your pipeline tool to forward the logs to your existing vendor, but add a single, high-impact rule, like dropping all DEBUG level logs or redacting email addresses.
Measure the Impact: Watch your ingestion volume and your bill. The immediate savings from this one change will build the business case for expanding the pipeline to more services.

Conclusion: Own Your Data, Own Your Costs

The default model of shipping all your data directly to a vendor is a recipe for uncontrolled spending and vendor lock-in. It optimizes for the vendor’s business model, not yours.

An observability pipeline flips the script. It’s a strategic investment in your architecture that pays dividends in cost savings, security, and flexibility. By placing a smart, vendor-neutral layer between your services and your backends, you transform observability from a runaway operational expense into a well-governed, cost-effective capability.

Stop being a passive consumer of observability services. Own your data, control its flow, and start dictating the terms of your monitoring strategy.

✏️ Personal Notes

The idea of adding another piece of infrastructure can seem daunting, but modern tools like the OTel Collector are lightweight, highly performant, and designed for this exact purpose. The operational overhead is often far less than the cost savings it unlocks.
This pattern is becoming a standard for mature engineering organizations. If you’re operating at any significant scale, you’re likely already feeling the pain that an observability pipeline is designed to solve.
The beauty of this approach is that it’s incremental. You can start with one service and one rule, prove the value, and grow from there. It’s a journey, not a big-bang migration.

What is an Observability Pipeline?#

The Superpowers of a Pipeline: How It Slashes Costs#

1. Aggressive Filtering and Sampling#

2. Smart Routing to Tiered Storage#

3. Enforce Privacy and Compliance by Design#

4. The Human Side: “But I Need All the Data During an Incident!”#

5. Breaking Vendor Lock-In#

Getting Started: A Pragmatic Approach#

Conclusion: Own Your Data, Own Your Costs#

✏️ Personal Notes#