Real Time Backtesting Guide – Alloy

Backtesting lets you test workflow changes using real historical data, so you can see how a rule update or threshold change would have performed before making it active. All results completely separate from your production environment and excluded from live queues, analytics, and billing.

Why It’s Valuable

Policy changes carry risk. Backtesting removes the guesswork by giving you a data-backed picture of how a change would have impacted real historical decisions before anything goes live.

Confidence before go-live: Validate new workflow logic against real historical data before setting it active
Safe experimentation: All results are informational and fully excluded from production analytics and billing. There's no risk in testing
Continuous optimization: Measure how rule or threshold changes affect outcomes and alert volumes withoxut waiting for new data to come in
Audit trail: Maintain a clear, reviewable record of testing before deploying rule changes

When to Use Backtesting

During implementation: If you've backfilled historical data, you can run it through a new workflow to confirm it behaves as expected before go-live. Use it to verify rule logic, validate tags, and establish a performance baseline to fine-tune against.
For ongoing policy optimization: Once you're live, use backtesting any time you're considering a rule change: new rules, adjusted thresholds, or updated logic. Compare how outcomes would have shifted on historical data before committing to the change.

How to Use Backtesting

Start a new test
- Open the Workflow Editor or the Workflow Versions page, click the “Test” button to enter the Testing Suite and select the “Backtest” option
- If you have unsaved edits, you'll be prompted to save them as a new version first. You can add a name and notes at that point
Choose your sample
- Compare to Another Version (recommended starting point) — Select an existing workflow version as your control. The same events or entities that ran through the control version will be reprocessed through your new version, giving you a clean apples-to-apples comparison. Click "View Changes" to see exactly what differs between the two versions
- Rerun a Past Test Sample — Reruns the exact same data from a previous test. Useful when you want to reproduce earlier results or compare multiple variations against the same dataset. Sample size can't be changed with this option
- Filter to a Custom Date Range (Events API clients only) — Define a custom population by date range. Particularly useful during implementation if you've backfilled historical data, or when you want to focus on a specific time period like a known fraud spike. If you don't see this option, your integration type doesn't support it
- Specify by Evaluation Token (Onboarding workflows) — Input a list of specific evaluation tokens to define your test population. Best for functional testing, for example confirming a rule fires on a specific event or entity you expect it to catch

Set your sample size
- The screen shows the total number of eligible events or entities
- We recommend a minimum of 500 for meaningful results; the maximum is 5,000
- Selections are made randomly from the eligible pool
Confirm and run
- Review your control vs. test version, sample size, and version differences
- You can name your test and add notes about what you are testing, then click Run Test
- You can safely close the page and come back later but navigating to “Test History” for the workflow you are testing. Most tests complete within an hour, though some may take longer
- You'll be taken to your results automatically when it's done

Review your results Your results page has two sections.
- Summary Charts show outcome distributions for both the control and test versions side by side, with percent change indicators for each category (e.g., Dismissed, Suspicious, Denied). If any evaluations came back partial, you'll see a count and warning banner. Partials are excluded from the charts so they don't skew your numbers.
- Detailed Results Table shows an evaluation-by-evaluation comparison, including Transaction ID or Entity and the original (control) vs. test outcomes and tags for each. You can filter by:

Entity ID or Name
Whether the outcome changed (Yes/No)
Specific outcome differences
Tag differences
Test evaluation status (Complete/Partial)

Click into any individual evaluation, then click on a tag to open Rule Explainability and see exactly which rules fired and why. This is the fastest way to validate that your changes are behaving as expected, or to investigate unexpected outcome shifts. Results are read-only. Since evaluations are informational, they can't be assigned, actioned, or moved to a live queue.

A Note on Third Party Services

If your workflow includes third-party services (like identity verification or device intelligence providers), backtesting uses cached historical responses from when those events were originally evaluated. It doesn't make new live calls to those services, which keeps your test results accurate to what actually happened at that point in time.

In some cases, if a cached response isn't available for a given event or entity, that evaluation will come back as a partial, meaning it completed but with incomplete data. Partials are flagged clearly in your results and excluded from the outcome and tag distribution charts so they don't skew your analysis.

A small number of services are not compatible with backtesting and will prevent the Backtest option from being available on that workflow. If you run into this, reach out to your Alloy team for guidance.

Portfolio Evaluation (PE) Backtesting

If you use Portfolio Evaluation workflows, backtesting is available there too. Open any PE workflow and you'll see the same Backtest tile. It automatically knows what type of test to run based on your workflow type.

Rather than replaying individual events, PE Backtesting takes a snapshot of your portfolio at a specific point in time and re-evaluates all entities through a new workflow version. This lets you measure policy impact across your entire portfolio before making changes live. The same informational guarantee applies: nothing touches your production environment.

FAQs

What's the difference between "control" and "test" versions? The control version is the existing workflow that originally processed your events or entities. The test version is the updated workflow you want to evaluate. Backtesting shows you what the test version would have decided on that same data, so you can see exactly what changed.

What does "informational" mean? Any output from a backtest, including evaluations, outcomes, and tags, is labeled "informational." This means it's automatically filtered out of your live queues, analytics dashboards, and billing. Nothing you do in a backtest affects production.

How long does a test take? The majority of tests complete in under an hour. A small number of configurations may take longer, up to 12-18 hours in some cases. You don't need to stay on the page. The system will take you to your results when it's done.

What does it mean if my test has "partial" evaluations? A partial means an evaluation completed with incomplete data, usually because a cached third-party service response wasn't available for that event or entity. Partials are flagged in your results and excluded from the summary charts so your distributions stay accurate.

I don't see the Backtest option on my workflow. Why? The most likely reason is that your workflow contains a third-party service that isn't compatible with backtesting. The Backtest tile will be disabled with an explanation. Reach out to your Alloy team if you'd like guidance on next steps.

We use Fraud Signal. How does that work in backtesting? If you're an existing Fraud Signal user, backtesting uses your historical cached FS scores, matched to within 24 hours of the original event timestamp. If you're adding Fraud Signal for the first time, you'll need to generate some live evaluations before backtesting with it becomes available.

I don't have historical data yet. Can I still run a test? You'll need historical data before you can run a backtest. If you're in early implementation, reach out to your Alloy team about backfilling data via the Events API.

Can I act on backtest results? No. Backtest evaluations are read-only and for analysis only. They can't be actioned or moved into a live queue.