Skip to main content

A/B Testing Triggers: Counterfactual Logging Makes ML Algorithm Experiments Super Efficient!

Ab-Testing
Kuan-Hao (Wilson)
Author
Kuan-Hao (Wilson)
Working at Google. Passionate about causal inference and A/B testing.
Table of Contents

In the previous article, we explored how trigger mechanisms solve the dilution problem in A/B testing, potentially cutting required sample sizes nearly in half. The theory sounds promising, but the next natural question is: “How exactly do I implement triggering?”

During my time at Google, I witnessed multiple teams implementing trigger mechanisms. What I learned were framework-specific to Google’s infrastructure. It’s valuable for understanding principles, but limited in practical application for most teams. Thus, I broadened my horizons by learning from more experts’ views. This article synthesizes learnings from major tech companies including Spotify and DoorDash, introducing three levels of trigger methods:

  1. User-Level Triggering: The simplest approach using user attributes
  2. Exposure Logging: Moderate complexity tracking actual user interactions
  3. Counterfactual Logging: The most sophisticated method for ML algorithms (the focus of this article!)

Each method comes with its applicable scenarios, trade-offs, and real-world case studies.

If you’re unfamiliar with the foundational concepts of trigger mechanisms, I recommend reading the previous article first to understand what triggering is and why it matters.


Method 1: User-Level Triggering
#

Just like surveys screen respondents, we only include users who meet specific criteria in our A/B tests

Applicable Scenarios
#

User-Level Triggering is the simplest trigger method, suitable for:

  • Experiments affecting only specific user segments (e.g., specific countries, premium users, particular devices)
  • Cases where target users can be identified before the experiment begins
  • Teams with limited technical resources unable to invest in complex logging systems
User-Level Triggering diagram
User-Level Triggering (Illustration by Kuan-Hao)

Approach
#

The core concept: Filter users at the experiment platform configuration stage. Only users meeting specified criteria get assigned to control or treatment groups.

Example configuration:

  • Target users: country = 'France' AND user_type = 'premium'
  • Only French premium members enter the experiment
  • Country and subscription type are known before experiment begins, enabling straightforward A/B assignment
graph TB; subgraph "User-Level Triggering Flow"; A[All Users] --> B{Meets User
Attribute Criteria?}; B -->|Yes| C[Include in Experiment]; B -->|No| D[Exclude]; C --> E[Random Assignment
Control vs Treatment]; end; style C fill:#90EE90; style D fill:#FFB6C1;

Case Study: Netflix Pricing Strategy Test
#

(Hypothetical scenario) Suppose Netflix wants to test new pricing only in Brazil. They could use User-Level Triggering to target only that region. The setup would be straightforward:

Experiment condition: country = 'Brazil'
- Control group (50%): See current pricing $8.99
- Treatment group (50%): See new pricing $7.99

However, this method has obvious limitations. While all Brazilian users enter the experiment, not everyone actually “experiences” the pricing change:

  • Annual subscribers still under contract won’t see new prices
  • Family plan secondary accounts don’t interact with payment pages
  • Inactive users won’t even login to see any changes

These Brazilian users, despite meeting User-Level Triggering criteria and being “in the experiment,” aren’t actually affected by the test. Their data merely dilutes the true pricing effect.

graph TB; subgraph "Triggering by Country"; F[Brazilian Users
✓ Included] --> G[Annual Subscribers
✗ Don't See New Price]; F --> H[Family Plan Secondary Accounts
✗ Don't Access Payment]; F --> I[Inactive Users
✗ Never Login]; F --> J[Actually Impacted Users
✓ See New Pricing]; G --> K[User-Level Triggering:
Still Includes Unaffected Users]; H --> K; I --> K; end; style J fill:#90EE90; style G fill:#FFB6C1; style H fill:#FFB6C1; style I fill:#FFB6C1; style K fill:#FFE6D1;

Pros and Cons
#

Advantages:

  • Simplest implementation: Just configure filtering conditions in the experiment platform (no code changes required)
  • No additional logging needed: Leverages existing user attribute data
  • Sufficient for 80% of basic experiments: Most A/B tests don’t face severe dilution; user-level filtering suffices

Disadvantages:

  • Can’t handle dynamic behaviors: Cannot determine if users truly “saw” or “interacted with” experimental changes; still includes significant noise that dilutes effects
    • Solution: Method 2’s Exposure Logging
  • Unsuitable for algorithm experiments: Cannot determine if new vs. old algorithms produce different results
    • Solution: Method 3’s Counterfactual Logging

Method 2: Exposure Logging
#

Only start counting when users actually see the modified page

Applicable Scenarios
#

Exposure Logging enables more precise triggering than user-level filtering:

  • Experiments modifying specific pages or features
  • Need to confirm users “actually saw” experimental changes
  • Standard practice for most UI/UX experiments
Exposure Logging diagram
Exposure Logging (Illustration by Kuan-Hao)

Approach
#

Core principle: Log as “exposed” only when users trigger specific behaviors; analyze only users with exposure logs.

Rather than filtering users before experiment start, this method determines inclusion during the experiment based on users’ actual behaviors.

For an e-commerce checkout page design A/B test, Exposure Logging works like this:

Step 1: Add logging code to critical pages
When user enters checkout page:
  Record exposure_event('checkout_page_viewed')

Step 2: Experiment runs normally
*All* users still get assigned to control or treatment

Step 3: Filter data for analysis
Analyze only users with exposure_event records

This approach works for both client-side (frontend) and server-side (backend) implementation, depending on your technical architecture.

graph LR; subgraph "Exposure Logging Flow"; A[User Enters Site] --> B[Assigned to Control/Treatment]; B --> C{Visits
Checkout Page?}; C -->|Yes| D[Record exposure_event]; C -->|No| E[No Record]; D --> F[Include in Analysis]; E --> G[Exclude from Analysis]; end; style D fill:#90EE90; style E fill:#FFB6C1; style F fill:#90EE90; style G fill:#FFB6C1;

Case Study: DoorDash Fraud Detection Experiment
#

In this technical blog post, DoorDash tested a new fraud detection algorithm. When the algorithm identified suspicious users, it added verification steps at checkout requiring credit card validation to prevent fraud. This scenario faced a classic dilution problem: users flagged as suspicious were inherently rare.

Without Exposure Logging:

  • Tracked all users: 44.5 million
  • Experiment results: Not statistically significant
  • Problem: The vast majority of users weren’t flagged as suspicious, never experiencing the new credit card verification flow (their data was just noise)

With Exposure Logging:

  • Tracked only users requiring additional verification: 292,000
  • Experiment results: Statistically significant fraud rate reduction
  • Experiment sensitivity improved 160x (calculation detailed in the original article, Example 1)

This case perfectly demonstrates Exposure Logging’s advantage: by excluding users unlikely to be affected (~44 million were just noise!), the A/B test could detect the true effect with far greater sensitivity.

Pros and Cons
#

Advantages:

  • More precise than User-Level (Method 1): More accurately tracks users potentially exposed to changes
  • Suitable for most UX experiments: Button colors, page layouts, feature modifications, etc.
  • Moderate technical complexity: Just requires adding logging code to key pages/features
  • Dramatic impact: As DoorDash’s case shows, can massively improve experiment sensitivity

Disadvantages:

  • Requires code modification: Must add new logging logic to relevant pages/features
  • May increase client-server communication: Client-side logging adds request/response overhead
  • Can’t handle “same results” situations: Still can’t solve the problem from Method 1. For algorithm experiments, Exposure Logging still can’t determine if new vs. old versions produce different results
    • Solution: Method 3’s Counterfactual Logging (detailed next)

Method 3: Counterfactual Logging
#

Only count it if new and old versions are genuinely “different”

Applicable Scenarios
#

Counterfactual Logging represents the ultimate solution of trigger mechanisms, specifically designed for:

  • Machine learning algorithm modifications
  • Search ranking algorithm tests
  • Recommendation system experiments
  • Any A/B test where new and old systems “produce identical results most of the time”

Essentially, any A/B test involving machine learning algorithms should consider Counterfactual Logging to dramatically increase experiment sensitivity.

Why do we need Counterfactual Logging? If you’ve ever practiced implementing ML algorithms in Python, you’ve likely experienced this: when optimizing ML models, even after tuning hyperparameters, the new model’s predictions often remain extremely similar to the old model’s.

Imagine testing a new recommendation algorithm. The new algorithm might recommend identical products for 90% of users. For that 90%, the experiment effectively didn’t happen, and control and treatment groups see exactly the same content.

Another common scenario: even when the new algorithm recommends different content, those recommendations appear lower in the ranking where users never see them. In the YouTube example below, even if the A/B test’s algorithms produce different recommendations, users might not scroll down the page, never experiencing any difference between versions A and B!

Counterfactual Logging diagram
Counterfactual Logging (Illustration by Kuan-Hao)

This is where Counterfactual Logging proves invaluable: users get “triggered” for analysis only when new and old algorithms produce different results. In our recommendation algorithm example, only the 10% of users receiving different product recommendations from new vs. old algorithms get triggered.

This advances beyond Method 2:

  • Exposure Logging only determines if users “might have” seen a feature
  • Counterfactual Logging precisely determines: even when accessing modified pages, whether users “actually experienced” differences between new and old versions

Approach: Dual-Track Parallel Execution
#

Counterfactual Logging requires “dual-track parallel” system execution:

For each server request:

  1. Execute the version actually assigned to the user (display to user)
  2. Simultaneously execute the other version in the background (don’t display; invisible to user)
  3. Compare results from both versions
  4. If results differ → Mark as “triggered,” include in A/B test analysis
    • This embodies the counterfactual concept: the background version differs from the reality users see
  5. If results identical → Exclude from analysis

In some Google teams, this approach is called “Shadow Mode”, because one version runs like a “shadow” in the background, invisible to users.

graph TB; subgraph "Counterfactual Logging:
Shadow Mode Dual-Track"; A[User Requests Recommendation] --> B{User Assignment}; B --> M[Control] --> |Actual| C[Run Old Algorithm
Display Results]; C --> G{Compare Results
Different?}; M --> |Counterfactual| E[Background Execution
New Algorithm]; B --> N[Treatment] --> |Actual| D[Run New Algorithm
Display Results]; D --> H{Compare Results
Different?}; E --> G; F[Background Execution
Old Algorithm] --> H{Compare Results
Different?}; N --> |Counterfactual| F; G -->|Yes| I[Triggered ✓
Include in Analysis]; G -->|No| J[Not Triggered ✗
Exclude from Analysis]; H -->|Yes| K[Triggered ✓
Include in Analysis]; H -->|No| L[Not Triggered ✗
Exclude from Analysis]; end; style I fill:#90EE90; style K fill:#90EE90; style J fill:#FFB6C1; style L fill:#FFB6C1;

Detailed Example: Recommendation System Experiment
#

Suppose you’re testing a new product recommendation algorithm:

User A (Control Group):

  • Old algorithm recommends: [Product 1, Product 2, Product 3]
  • New algorithm (background execution) would recommend: [Product 1, Product 4, Product 5]
  • Results differ → Triggered ✅
  • User A is in control, so actually sees old algorithm’s [Product 1, Product 2, Product 3]

User B (Control Group):

  • Old algorithm recommends: [Product 6, Product 7, Product 8]
  • New algorithm (background execution) also recommends: [Product 6, Product 7, Product 8]
  • Results identical → Not triggered ❌
  • User B’s data excluded from analysis, i.e., new vs. old makes no difference for them

User C (Treatment Group):

  • Old algorithm (background execution) recommends: [Product 9, Product 10, Product 11]
  • New algorithm recommends: [Product 9, Product 10, Product 42]
  • Results differ → Triggered ✅
  • User C is in treatment, so actually sees new algorithm’s [Product 9, Product 10, Product 42]

Data Scientist’s Analysis: Compare only behavior of User A and User C (both triggered), ignoring User B (not triggered).

Case Study: Spotify’s Music Recommendation System
#

Spotify shared in their technical blog that improving their music recommendation algorithm also faced typical low trigger rate challenges. New algorithms might produce different recommendations for only a small fraction of users.

Through Counterfactual Logging, Spotify could:

  • Identify users where new vs. old algorithms truly produced different recommendations (trigger rate 5-15%)
  • Focus analysis on these users’ behavioral differences
  • Detect subtler effect improvements in the same timeframe

According to industry research, for experiments with 5% trigger rates, Counterfactual Logging can reduce required experiment time by 20x.

Pros and Cons
#

Advantages:

  • Most precise triggering method: Includes only users truly experiencing differences
  • Most effective for low trigger rate experiments: Maximum benefit when trigger rate < 10%
  • Industry standard practice: Spotify, Booking.com, Netflix all use this for algorithm experiments

Disadvantages:

  • High computational cost
    • Each request requires dual-track execution of two algorithms (actual + counterfactual versions)
    • If new algorithm has high computational complexity, system load increases significantly
    • Must evaluate whether more precise experiment results justify this cost
  • High technical complexity
    • Dual-track algorithm execution obviously presents technical barriers (requires experiment platform support)
    • Needs robust logging systems to record counterfactual results
    • Most critically, must ensure counterfactual execution’s computational burden doesn’t degrade server performance or impact user experience
  • Cross-service coordination: In microservices architecture, may require coordination across multiple services. Upstream service decisions need passing to downstream services, requiring unified experiment coordination mechanisms

Counterfactual Logging is precise and powerful, but admittedly quite complex (ˊ_>ˋ)

For these challenges, Trustworthy Online Controlled Experiments discusses common industry performance optimization strategies:

  • Asynchronous processing: Move counterfactual computation off critical path, avoiding user response time impact
  • Sampling strategies: Apply counterfactual logging to only partial traffic, not 100%
  • A/A’/B experiments: Use A/A’/B experiments to evaluate “performance impact” itself
    • Group A has no counterfactual computation, while A’ forces counterfactual computation
    • Group A and A’ see identical results for users
    • Therefore, comparing A vs. A’ reveals performance impact

Both implementation and optimization are complex, so product teams must carefully consider when to invest in Counterfactual Logging:

  • Severe dilution: Trigger rate < 10%
  • Business-critical experiment (e.g., core recommendation/search algorithms)
  • Team has sufficient engineering resources to build infrastructure

Which Trigger Method Should You Choose?
#

Here’s a method selection guide that also recaps everything covered in this article:

ComparisonUser-LevelExposure LoggingCounterfactual Logging
Implementation Difficulty⭐ Simple⭐⭐ Moderate⭐⭐⭐⭐⭐⭐ Very Complex!
PrecisionLowMediumUltra-High!
Suitable Trigger RateAny>10%<10% Optimal
Performance CostNoneLowHigh (2x computation)
Code Modification NeededNoYes (Add logging)Yes (Shadow Mode)
Typical ApplicationsRegion/User segmentationUI/UX modificationsAlgorithm/ML experiments

Trigger Method Decision Flow
#

What's your experiment type?
│
├─ User segmentation experiment (e.g., specific country/device)
│  → Use User-Level Triggering
│
├─ UI/UX modification (page design/buttons/flows)
│  │
│  ├─ Expected trigger rate > 20%
│  │  → Consider User-Level Triggering (simpler)
│  │
│  └─ Expected trigger rate < 20%
│     → Recommend Exposure Logging (more precise)
│
└─ Algorithm/Machine Learning experiment
   │
   ├─ Expected trigger rate > 10%
   │  → Exposure Logging suffices
   │
   └─ Expected trigger rate < 10%
      └─ Sufficient engineering resources?
         ├─ Yes → Counterfactual Logging (best choice)
         └─ No → Exposure Logging (compromise solution)

Practical Recommendations
#

Synthesizing learnings from my Google experience and industry practices at Spotify, DoorDash, and others, here are expert recommendations for A/B testing trigger mechanisms:

Progressive Implementation of Triggering
#

  • Phase 1: Trigger based on User-Level attributes (e.g., selecting specific countries for experiments) as the most basic approach
  • Phase 2: Improve trigger precision starting with Exposure Logging
  • Phase 3: If ML algorithms affect < 10% of users, consider Counterfactual Logging (requires investing in complete “dual-track parallel execution” infrastructure)

Common Pitfalls ⚠️
#

  • Prematurely investing in complex Counterfactual Logging without infrastructure support
  • Neglecting performance monitoring, letting counterfactual logging drag down system performance
  • Poorly designed trigger conditions causing Sample Ratio Mismatch (SRM)
  • Blindly implementing without evaluating trigger rate (dilution severity), leading to poor ROI

Benefit Assessment Metrics
#

  • Trigger Rate: Actually impacted users / Total experiment users, evaluating dilution severity
  • Sample Size Reduction Ratio: Required samples with triggering vs. without
  • Experiment Sensitivity Improvement: Enhancement in Minimum Detectable Effect (MDE)
  • Return on Investment: Engineering cost vs. experiment efficiency gains

Conclusion
#

Everyone wants faster, more precise A/B tests. Before diving into fancy methods (CUPED, Bayesian approaches, etc.), triggering represents another excellent choice for reducing sample sizes and improving experiment sensitivity.

Trigger mechanisms aren’t a yes/no binary choice. They exist across different complexity levels. Select the most appropriate approach based on experiment context, trigger rate, and team resources:

  • User-Level Triggering: Simple and practical, ideal for clear user segmentation experiments
  • Exposure Logging: Moderate precision, suitable for most UI/UX experiments, the starting point for improving trigger accuracy
  • Counterfactual Logging: Most precise but most complex, the ultimate weapon for ML algorithm experiments

What matters most isn’t using the coolest, most complex technology, but choosing the most appropriate method for your current needs. Most teams start with simple Exposure Logging, then gradually upgrade to more sophisticated approaches as experience accumulates and needs grow.

References: