Building an Agentic AI System for Anomaly Detection: From Theory to Practice

Oct 21, 2025
12 min read

A few weeks ago, I wrote about agentic AI in analytics and how autonomous systems are transforming how we work with data. The post resonated with many in the analytics community, but talking about concepts is one thing—actually building something is another.

So I decided to put theory into practice. I built an autonomous anomaly detection system for Google Analytics 4 that runs 24/7, monitors real GA4 data, and alerts me when something genuinely unusual happens. Not as a proof of concept, but as a production system that handles actual business data.

This post is about that journey: the decisions I made, the problems I encountered, and what I learned building a real agentic AI system.

The Challenge: Manual Monitoring Doesn't Scale

Like most people responsible for digital analytics, I found myself checking dashboards multiple times a day. The routine was familiar: open GA4, scan the key metrics, look for anything unusual, close the tab. Repeat a few hours later.

The problem is obvious: this doesn't scale. You can't watch every metric with the attention it deserves. By the time you spot something in your weekly review, the moment to act has often passed. And honestly, most checks reveal nothing unusual - it's just normal day-to-day variation.

What I needed was a system that could monitor continuously and alert me only when something actually warranted attention. The goal wasn't to eliminate human judgment, but to make it more efficient. Let the system handle the routine surveillance, and bring humans in when their expertise is actually needed.

Building the Foundation: Choosing a Data Source

The first major decision was how to access GA4 data. There are two main approaches, each with different trade-offs.

Option 1: GA4 Data API

The GA4 Data API is straightforward to set up. You enable it in Google Cloud Console, create a service account, grant it access to your GA4 property, and you're pulling data within 30 minutes.

The advantages are clear:

Quick setup with minimal infrastructure
Suitable for most monitoring use cases
Real-time data access (with approximately 30-minute processing lag)

The limitations become apparent at scale:

Rate limits of 10 requests per second
Maximum 100,000 rows per request
Limited query flexibility compared to SQL
Costs can add up with high-volume monitoring

Option 2: BigQuery Export

GA4's BigQuery export sends your raw event data to Google's data warehouse continuously. This is the more robust option, especially for production systems.

The benefits:

No API rate limits to worry about
Full SQL query capabilities
Can join with other data sources
More cost-effective at scale
Supports real-time streaming for GA4 360 customers

The trade-offs:

Requires enabling BigQuery export (24-hour delay for initial setup)
Slightly steeper learning curve if you're not familiar with SQL
Need to understand BigQuery's data structure

For this project, I started with the GA4 Data API because I wanted to validate the concept quickly. It worked well for my needs, but I'd recommend BigQuery for production deployments at scale or if you need custom metrics and complex analysis.

Defining Success: Which Metrics Matter

Not all metrics deserve equal attention. I focused on four core metrics that tell the story of business health:

Sessions - Overall traffic trends
Active Users - Unique visitor engagement
Conversions - Critical actions taken
Revenue - Direct business impact

Each metric was assigned a business impact level (high or critical), which later influenced how alerts were prioritized and routed. The key insight here: monitor outcomes you care about, not vanity metrics that look good in presentations but don't drive decisions.

The Technical Challenge: What Is an Anomaly?

This is where things got interesting. Defining "anomaly" turns out to be harder than it sounds.

If sessions are 15% above average, is that unusual? What about 25%? 40%? The answer depends entirely on your data's natural variability. A 20% swing might be completely normal for one metric but alarming for another.

First Attempt: Statistical Detection with Z-Scores

I started with the classic statistical approach that you'll find in most data science textbooks: Z-score analysis.

The concept is elegant. Calculate the historical mean and standard deviation for each metric. Then, for each new data point, measure how many standard deviations away from the mean it falls. If it's beyond a certain threshold - say, 2.5 standard deviations - flag it as an anomaly.

I configured three sensitivity levels:

High (2.0σ): Catches roughly 95% of normal variation
Medium (2.5σ): Catches about 98.8% of normal variation
Low (3.0σ): Catches around 99.7% of normal variation

This should have worked beautifully. And for well-behaved, normally-distributed data with consistent variance, it does. But my data had other ideas.

The Reality: Digital Analytics Data Is Extremely Volatile

When I ran the system against actual GA4 data, I discovered something that fundamentally changed my approach: digital analytics data is far more volatile than traditional statistical methods can handle effectively.

Here's what I found in my data:

Sessions:

Standard deviation: 35,157 (25.7% of the mean)
A 21.5% increase from average generated a Z-score of only 0.84
With high sensitivity settings (2.0σ), I would need a 51% change to trigger an alert

Revenue:

Standard deviation: $190,946 (34.2% of the mean)
A 29.9% increase generated a Z-score of 0.88
Would need a 68% change to trigger an alert with high sensitivity

Think about that for a moment. A 30% revenue change - which is absolutely something you'd want to know about immediately - was being classified as "normal variation" by the statistical model.

Understanding the Volatility

Several factors contribute to this high day-to-day variance:

Weekend vs. weekday effects create dramatic traffic pattern differences
Campaign activity generates spikes when campaigns launch and drops when they pause
Seasonal patterns introduce week-to-week variation even within the same month
Natural randomness is inherent in how people interact with websites

When you calculate standard deviation across all days together, mixing weekends with weekdays and campaign periods with baseline periods, you end up with a large value that makes it nearly impossible to detect meaningful anomalies using purely statistical methods.

The Pivot: Percentage-Based Detection

After seeing Z-scores below 1.0 for changes I definitely wanted to know about, I realized the approach needed to change. The solution turned out to be simpler than the problem: percentage-based thresholds.

Instead of asking "Is this X standard deviations from the mean?" I started asking "Is this more than X percent different from the historical average?"

The math is straightforward:

Deviation % = ((Current Value - Historical Mean) / Historical Mean) × 100

If the absolute deviation exceeds a defined threshold, flag it as an anomaly.

Setting the Thresholds

I configured thresholds based on business context and observed data patterns:

Sessions & Active Users: ±20% threshold
Conversions & Revenue: ±25% threshold

These weren't arbitrary numbers. They came from asking: "What level of change would I actually want to be notified about?" Combined with observing typical variation patterns, these thresholds hit the sweet spot between catching real issues and avoiding alert fatigue.

Why This Works Better

Predictability. Telling someone "I'll alert you if revenue changes by more than 25%" requires no statistical background to understand. It's immediately clear what will and won't trigger an alert.
Business alignment. Stakeholders think in percentages, not standard deviations. A "25% revenue drop" is intuitively meaningful in a way "2.8 standard deviations below mean" simply isn't.
Tunability. If you're getting too many alerts, increase the threshold to 30%. Too few? Lower it to 20%. The relationship between the setting and the outcome is direct and obvious.
Robustness to volatility. High standard deviation doesn't matter anymore. A 22% change is a 22% change, regardless of how volatile your historical data has been.

The Results

With percentage-based detection in place, my October 6 data finally triggered appropriate alerts:

Metric	Actual Change	Threshold	Detection
Sessions	+21.5%	±20%	Anomaly
Active Users	+18.6%	±20%	Normal
Conversions	+26.8%	±25%	Anomaly
Revenue	+29.9%	±25%	Anomaly

Three legitimate anomalies detected, one normal fluctuation correctly ignored. This is exactly what you want from an anomaly detection system.

Improving Accuracy: The Day-of-Week Solution

Even with percentage-based detection working well, there's a refinement that significantly improves accuracy: comparing like days to like days.

Monday traffic patterns are fundamentally different from Saturday patterns. Comparing Monday to the overall average (which includes low-traffic weekends) inflates variance and can trigger false positives.

The solution: compare Monday to previous Mondays, Tuesday to previous Tuesdays, and so on.

Implementation approach:

Fetch 90 days of historical data (about 13 instances of each day of the week)
When analyzing a Monday, calculate mean and threshold using only previous Mondays
Apply the same logic for each day of the week

This respects the natural weekly rhythm of your business. If your weekend traffic is consistently 40% lower than weekdays, the system won't incorrectly flag that as anomalous.

The 90-Day Baseline Decision

How much historical data should inform your baseline? This turned out to be more important than I initially thought.

Too short (7-14 days): Baseline becomes too sensitive to recent fluctuations. A single unusual week can skew your entire baseline.
Too long (180+ days): You might miss gradual trends or seasonal shifts. Your baseline becomes stale.
The middle ground (60-90 days): Long enough to establish stable patterns, recent enough to reflect current business reality.

I settled on 90 days. It captures about three months of patterns, smoothing out weekly volatility while staying relevant to current conditions.

Handling Data Processing Delays

GA4 data isn't final immediately. It can take 24-48 hours for all events to be fully processed and reconciled. You might see 10,000 sessions for yesterday today, but check tomorrow and that same date shows 15,000 sessions as late-arriving events are incorporated.

This creates a problem: if you analyze incomplete data, you'll get false anomalies from data that simply hasn't finished processing yet.

My solution: Analyze data from 2 days ago.

By the time you're looking at data from two days ago, GA4 has had 48 hours to process everything. The numbers are stable and won't change. Your comparisons are accurate.

The trade-off is clear: you're not getting real-time anomaly detection. But you are getting reliable anomaly detection for complete data. For most businesses, being notified about yesterday's anomaly today is perfectly adequate - and far better than being notified about a false anomaly that isn't actually real.

Building Alert Intelligence: Severity Classification

Not every anomaly deserves the same response. A 21% increase in sessions is notable but probably doesn't need to wake up the CEO. A 45% revenue drop absolutely does.

I implemented a severity matrix that considers both the size of the deviation and the business impact of the metric:

For critical business impact metrics (Revenue, Conversions):

Deviation >40% = Critical severity
Deviation 25-40% = High severity
Deviation 15-25% = Medium severity

For high business impact metrics (Sessions, Users):

Deviation >50% = High severity
Deviation 30-50% = Medium severity
Deviation 20-30% = Low severity

This maps to alert routing:

Critical severity: Leadership Slack channel + email to executives
High/Medium severity: Analytics team Slack channel + email to team
Low severity: Logged but might not generate active alerts

Creating Actionable Alerts

The difference between a useful alert and noise is context. When the system detects an anomaly, it needs to provide enough information that someone can understand what happened and decide what to do about it.

My alerts include:

Core metrics:

Current value and expected value
Percentage deviation and the threshold that triggered it
Date being analyzed
Severity level

Context:

Dimensional breakdown showing which channels, devices, or sources contributed most
Brief analysis suggesting where to look first
Historical trend showing the last several days for pattern recognition

Example Email alert:

This gives whoever sees the alert enough information to either dismiss it as expected (maybe we just launched a sale) or investigate further with a clear starting point.

Dealing with False Positives

Even with good thresholds and day-of-week comparisons, false positives happen. The system includes several mechanisms to minimize them:

Minimum severity filtering: Only send alerts for medium severity and above during the trial period. This lets you validate the system without being overwhelmed.
Configurable thresholds per metric: Some metrics are naturally more volatile than others. Don't force them all to use the same threshold.
Historical context in alerts: Seeing the last 7 days of data helps you quickly recognize if something is genuinely unusual or just normal variation that happened to cross a threshold.
Iterative tuning: Track alert frequency and usefulness. If you're consistently dismissing alerts for a particular metric, increase its threshold.

My target is 2-4 meaningful alerts per week. More than that and I'm probably being too sensitive. Less than that and I might be missing important signals.

The System in Practice

The complete system runs as a scheduled task that executes daily:

Fetch data: Pull the last 90 days of metrics from GA4
Calculate baselines: Compute historical mean for each metric (optionally by day of week)
Detect anomalies: Compare the most recent complete day (2 days ago) against the baseline
Calculate deviation: Determine percentage difference from expected
Classify severity: Apply the severity matrix based on deviation and business impact
Gather context: For detected anomalies, fetch dimensional breakdowns
Generate alerts: Format and send email messages and/or slack messages
Log everything: Keep records for future analysis and threshold tuning

The system is self-sufficient. Once configured and deployed, it requires minimal maintenance - mainly periodic threshold adjustments based on how well it's performing.

Real Impact: Before and After

The practical difference has been significant.

Before the system:

Manually checked GA4 3-4 times daily
Often discovered issues hours or days after they occurred
Spent 30-45 minutes daily on monitoring (often finding nothing unusual)
Occasionally missed significant changes that were buried in normal noise

After implementation:

System monitors continuously, I check only when alerted
Notified within hours of any meaningful change (limited only by the 2-day data lag)
Spend about 5 minutes per week reviewing legitimate alerts
Haven't missed a significant anomaly (>20% change) since deployment

The time savings alone would justify the effort, but the real value is confidence. I know the system is watching, and I'll be alerted if anything unusual happens. That peace of mind is worth more than the hours saved.

Key Lessons Learned

1. Reality Beats Theory

I started with sophisticated statistical methods because that's what the literature recommended. Real-world data showed me that simpler percentage-based thresholds work better for volatile digital analytics data. Theory matters, but results matter more.

2. Data Quality Outweighs Algorithmic Sophistication

The 2-day lag to ensure complete data was more impactful than any algorithmic improvement I could make. Having accurate data beats having clever algorithms operating on incomplete data.

3. Make It Understandable

A system that alerts on "2.5 standard deviations from mean" requires explanation every time. A system that alerts on ">25% revenue change" is immediately actionable. Design for the humans who will use the system, not for algorithmic elegance.

4. Alert Fatigue Is Real

Ten alerts per day that are mostly noise is worse than two alerts per week that consistently matter. Better to err on the side of fewer, higher-quality alerts than more alerts that train people to ignore them.

5. Context Makes Alerts Actionable

"Revenue dropped 30%" sends someone to investigate with no clear direction. "Revenue dropped 30%, primarily from mobile organic search traffic" tells them where to start looking. The dimensional breakdown transforms an alert from a problem statement into a diagnostic starting point.

What's Next: Future Enhancements

The current system works well for its intended purpose, but there are several directions for improvement:

Predictive anomalies: Instead of detecting anomalies after they occur, use trend analysis to predict if current patterns will lead to an anomaly in the near future. This shifts from reactive to proactive.
Deeper root cause analysis: Automatically investigate detected anomalies more thoroughly, checking correlations with external factors like weather, news events, or competitor actions.
Automated remediation: For specific, well-defined anomaly types (like a broken tracking implementation or a paused campaign that should be running), automatically take corrective action within defined guardrails.
Natural language interface: Query the system conversationally: "What caused the revenue spike last Tuesday?" and receive an AI-generated analysis.
Continuous learning: Track which alerts led to action and which were dismissed. Use this feedback to automatically tune thresholds over time

Practical Guidance for Implementation

If you're considering building something similar, here's a realistic roadmap based on what worked for me:

Week 1: Foundation

Set up GA4 Data API access or BigQuery export
Create and test service account credentials
Verify you can pull data for your key metrics
Choose 3-5 metrics to monitor initially (you can always add more later)

Week 2: Detection Logic

Implement percentage-based deviation calculation
Set initial thresholds (start with ±20-25% as a baseline)
Add severity classification logic
Test with historical data to validate your thresholds are reasonable

Week 3: Alerting

Configure Slack webhook or email SMTP settings
Create alert message templates with proper formatting
Test alert delivery end-to-end
Set up routing by severity level

Week 4: Automation

Schedule automated daily runs
Add comprehensive logging
Deploy to an always-on environment (cloud VM or server)
Monitor for a week and tune thresholds based on alert frequency

Ongoing: Refinement

Track which alerts were useful vs. noise
Adjust thresholds quarterly based on performance
Gradually add additional metrics
Enhance with dimensional analysis and context

Technical Considerations

A few practical technical notes without diving into implementation details:

Language: Python is well-suited for this. Strong data science libraries, mature GA4 SDKs, and excellent community support.

Infrastructure:

Development: Run on your local machine for testing
Production: Cloud VM (Google Cloud, AWS, DigitalOcean) for 24/7 operation
Enterprise: Containerized deployment with Kubernetes for scale

Data storage:

Minimal: Log files only for debugging
Better: PostgreSQL for storing anomaly history
Best: Data warehouse integration for comprehensive historical analysis

Scheduling:

Simple: Python's schedule library or cron jobs
Production: Apache Airflow or Prefect for robust workflow management

Cost estimate for production deployment:

GA4 Data API: Free within rate limits
BigQuery: ~$0-50/month for typical usage
Cloud hosting: ~$20-100/month depending on requirements
Total: Under $200/month for a full production system

Bringing It All Together

This project started as a way to validate the agentic AI concepts I wrote about previously. What I found is that the practical challenges and decisions involved in building a real system teach you things that no amount of theoretical discussion can convey.

The key insights:

Percentage-based thresholds proved more effective than statistical methods for volatile digital analytics data.
Day-of-week comparisons dramatically reduce false positives by respecting natural business rhythms.
90-day historical baselines provide stable reference points without becoming stale.
Two-day data lag ensures accuracy by waiting for complete data rather than reacting to incomplete information.
Rich, contextual alerts with dimensional breakdowns make notifications actionable rather than just informative.
Simplicity matters. The most sophisticated approach isn't always the most effective. Sometimes straightforward methods that people can understand and trust work better than complex algorithms that remain opaque.

If you're spending significant time manually monitoring dashboards and wondering if you're missing important changes in your data, consider building or implementing an automated anomaly detection system. The initial investment of time pays back quickly, and the ongoing benefit of having a reliable system watching your data continuously is substantial.

The future of analytics isn't more dashboards to manually check. It's intelligent systems that continuously monitor your data and alert you when human attention and expertise are actually needed. This project showed me that future is already achievable with current technology - you just need to build it.