top of page

Building an Agentic AI System for Anomaly Detection: From Theory to Practice

  • Writer: Rahul Ramanujam
    Rahul Ramanujam
  • Oct 20
  • 12 min read

A few weeks ago, I wrote about agentic AI in analytics and how autonomous systems are transforming how we work with data. The post resonated with many in the analytics community, but talking about concepts is one thing—actually building something is another.

So I decided to put theory into practice. I built an autonomous anomaly detection system for Google Analytics 4 that runs 24/7, monitors real GA4 data, and alerts me when something genuinely unusual happens. Not as a proof of concept, but as a production system that handles actual business data.


This post is about that journey: the decisions I made, the problems I encountered, and what I learned building a real agentic AI system.

The Challenge: Manual Monitoring Doesn't Scale

Like most people responsible for digital analytics, I found myself checking dashboards multiple times a day. The routine was familiar: open GA4, scan the key metrics, look for anything unusual, close the tab. Repeat a few hours later.


The problem is obvious: this doesn't scale. You can't watch every metric with the attention it deserves. By the time you spot something in your weekly review, the moment to act has often passed. And honestly, most checks reveal nothing unusual - it's just normal day-to-day variation.


What I needed was a system that could monitor continuously and alert me only when something actually warranted attention. The goal wasn't to eliminate human judgment, but to make it more efficient. Let the system handle the routine surveillance, and bring humans in when their expertise is actually needed.

Building the Foundation: Choosing a Data Source

The first major decision was how to access GA4 data. There are two main approaches, each with different trade-offs.


Option 1: GA4 Data API

The GA4 Data API is straightforward to set up. You enable it in Google Cloud Console, create a service account, grant it access to your GA4 property, and you're pulling data within 30 minutes.

The advantages are clear:

  • Quick setup with minimal infrastructure

  • Suitable for most monitoring use cases

  • Real-time data access (with approximately 30-minute processing lag)

The limitations become apparent at scale:

  • Rate limits of 10 requests per second

  • Maximum 100,000 rows per request

  • Limited query flexibility compared to SQL

  • Costs can add up with high-volume monitoring


Option 2: BigQuery Export

GA4's BigQuery export sends your raw event data to Google's data warehouse continuously. This is the more robust option, especially for production systems.

The benefits:

  • No API rate limits to worry about

  • Full SQL query capabilities

  • Can join with other data sources

  • More cost-effective at scale

  • Supports real-time streaming for GA4 360 customers

The trade-offs:

  • Requires enabling BigQuery export (24-hour delay for initial setup)

  • Slightly steeper learning curve if you're not familiar with SQL

  • Need to understand BigQuery's data structure


For this project, I started with the GA4 Data API because I wanted to validate the concept quickly. It worked well for my needs, but I'd recommend BigQuery for production deployments at scale or if you need custom metrics and complex analysis.

Defining Success: Which Metrics Matter

Not all metrics deserve equal attention. I focused on four core metrics that tell the story of business health:


  • Sessions - Overall traffic trends

  • Active Users - Unique visitor engagement

  • Conversions - Critical actions taken

  • Revenue - Direct business impact


Each metric was assigned a business impact level (high or critical), which later influenced how alerts were prioritized and routed. The key insight here: monitor outcomes you care about, not vanity metrics that look good in presentations but don't drive decisions.

The Technical Challenge: What Is an Anomaly?

This is where things got interesting. Defining "anomaly" turns out to be harder than it sounds.


If sessions are 15% above average, is that unusual? What about 25%? 40%? The answer depends entirely on your data's natural variability. A 20% swing might be completely normal for one metric but alarming for another.


First Attempt: Statistical Detection with Z-Scores

I started with the classic statistical approach that you'll find in most data science textbooks: Z-score analysis.


The concept is elegant. Calculate the historical mean and standard deviation for each metric. Then, for each new data point, measure how many standard deviations away from the mean it falls. If it's beyond a certain threshold - say, 2.5 standard deviations - flag it as an anomaly.


I configured three sensitivity levels:

  • High (2.0σ): Catches roughly 95% of normal variation

  • Medium (2.5σ): Catches about 98.8% of normal variation

  • Low (3.0σ): Catches around 99.7% of normal variation


This should have worked beautifully. And for well-behaved, normally-distributed data with consistent variance, it does. But my data had other ideas.

The Reality: Digital Analytics Data Is Extremely Volatile

When I ran the system against actual GA4 data, I discovered something that fundamentally changed my approach: digital analytics data is far more volatile than traditional statistical methods can handle effectively.


Here's what I found in my data:


Sessions:

  • Standard deviation: 35,157 (25.7% of the mean)

  • A 21.5% increase from average generated a Z-score of only 0.84

  • With high sensitivity settings (2.0σ), I would need a 51% change to trigger an alert

Revenue:

  • Standard deviation: $190,946 (34.2% of the mean)

  • A 29.9% increase generated a Z-score of 0.88

  • Would need a 68% change to trigger an alert with high sensitivity


Think about that for a moment. A 30% revenue change - which is absolutely something you'd want to know about immediately - was being classified as "normal variation" by the statistical model.


Understanding the Volatility

Several factors contribute to this high day-to-day variance:


  • Weekend vs. weekday effects create dramatic traffic pattern differences

  • Campaign activity generates spikes when campaigns launch and drops when they pause

  • Seasonal patterns introduce week-to-week variation even within the same month

  • Natural randomness is inherent in how people interact with websites


When you calculate standard deviation across all days together, mixing weekends with weekdays and campaign periods with baseline periods, you end up with a large value that makes it nearly impossible to detect meaningful anomalies using purely statistical methods.

The Pivot: Percentage-Based Detection

After seeing Z-scores below 1.0 for changes I definitely wanted to know about, I realized the approach needed to change. The solution turned out to be simpler than the problem: percentage-based thresholds.


Instead of asking "Is this X standard deviations from the mean?" I started asking "Is this more than X percent different from the historical average?"


The math is straightforward:

Deviation % = ((Current Value - Historical Mean) / Historical Mean) × 100

If the absolute deviation exceeds a defined threshold, flag it as an anomaly.


Setting the Thresholds

I configured thresholds based on business context and observed data patterns:

  • Sessions & Active Users: ±20% threshold

  • Conversions & Revenue: ±25% threshold


These weren't arbitrary numbers. They came from asking: "What level of change would I actually want to be notified about?" Combined with observing typical variation patterns, these thresholds hit the sweet spot between catching real issues and avoiding alert fatigue.


Why This Works Better

  1. Predictability. Telling someone "I'll alert you if revenue changes by more than 25%" requires no statistical background to understand. It's immediately clear what will and won't trigger an alert.

  2. Business alignment. Stakeholders think in percentages, not standard deviations. A "25% revenue drop" is intuitively meaningful in a way "2.8 standard deviations below mean" simply isn't.

  3. Tunability. If you're getting too many alerts, increase the threshold to 30%. Too few? Lower it to 20%. The relationship between the setting and the outcome is direct and obvious.

  4. Robustness to volatility. High standard deviation doesn't matter anymore. A 22% change is a 22% change, regardless of how volatile your historical data has been.


The Results

With percentage-based detection in place, my October 6 data finally triggered appropriate alerts:

Metric

Actual Change

Threshold

Detection

Sessions

+21.5%

±20%

Anomaly

Active Users

+18.6%

±20%

Normal

Conversions

+26.8%

±25%

Anomaly

Revenue

+29.9%

±25%

Anomaly

Three legitimate anomalies detected, one normal fluctuation correctly ignored. This is exactly what you want from an anomaly detection system.

Improving Accuracy: The Day-of-Week Solution

Even with percentage-based detection working well, there's a refinement that significantly improves accuracy: comparing like days to like days.


Monday traffic patterns are fundamentally different from Saturday patterns. Comparing Monday to the overall average (which includes low-traffic weekends) inflates variance and can trigger false positives.


The solution: compare Monday to previous Mondays, Tuesday to previous Tuesdays, and so on.


Implementation approach:

  • Fetch 90 days of historical data (about 13 instances of each day of the week)

  • When analyzing a Monday, calculate mean and threshold using only previous Mondays

  • Apply the same logic for each day of the week


This respects the natural weekly rhythm of your business. If your weekend traffic is consistently 40% lower than weekdays, the system won't incorrectly flag that as anomalous.

The 90-Day Baseline Decision

How much historical data should inform your baseline? This turned out to be more important than I initially thought.

  • Too short (7-14 days): Baseline becomes too sensitive to recent fluctuations. A single unusual week can skew your entire baseline.

  • Too long (180+ days): You might miss gradual trends or seasonal shifts. Your baseline becomes stale.

  • The middle ground (60-90 days): Long enough to establish stable patterns, recent enough to reflect current business reality.


I settled on 90 days. It captures about three months of patterns, smoothing out weekly volatility while staying relevant to current conditions.

Handling Data Processing Delays

GA4 data isn't final immediately. It can take 24-48 hours for all events to be fully processed and reconciled. You might see 10,000 sessions for yesterday today, but check tomorrow and that same date shows 15,000 sessions as late-arriving events are incorporated.


This creates a problem: if you analyze incomplete data, you'll get false anomalies from data that simply hasn't finished processing yet.


My solution: Analyze data from 2 days ago.


By the time you're looking at data from two days ago, GA4 has had 48 hours to process everything. The numbers are stable and won't change. Your comparisons are accurate.


The trade-off is clear: you're not getting real-time anomaly detection. But you are getting reliable anomaly detection for complete data. For most businesses, being notified about yesterday's anomaly today is perfectly adequate - and far better than being notified about a false anomaly that isn't actually real.

Building Alert Intelligence: Severity Classification

Not every anomaly deserves the same response. A 21% increase in sessions is notable but probably doesn't need to wake up the CEO. A 45% revenue drop absolutely does.

I implemented a severity matrix that considers both the size of the deviation and the business impact of the metric:


For critical business impact metrics (Revenue, Conversions):

  • Deviation >40% = Critical severity

  • Deviation 25-40% = High severity

  • Deviation 15-25% = Medium severity

For high business impact metrics (Sessions, Users):

  • Deviation >50% = High severity

  • Deviation 30-50% = Medium severity

  • Deviation 20-30% = Low severity

This maps to alert routing:

  • Critical severity: Leadership Slack channel + email to executives

  • High/Medium severity: Analytics team Slack channel + email to team

  • Low severity: Logged but might not generate active alerts

Creating Actionable Alerts

The difference between a useful alert and noise is context. When the system detects an anomaly, it needs to provide enough information that someone can understand what happened and decide what to do about it.


My alerts include:

Core metrics:

  • Current value and expected value

  • Percentage deviation and the threshold that triggered it

  • Date being analyzed

  • Severity level

Context:

  • Dimensional breakdown showing which channels, devices, or sources contributed most

  • Brief analysis suggesting where to look first

  • Historical trend showing the last several days for pattern recognition

Example Email alert:

ree

This gives whoever sees the alert enough information to either dismiss it as expected (maybe we just launched a sale) or investigate further with a clear starting point.

Dealing with False Positives

Even with good thresholds and day-of-week comparisons, false positives happen. The system includes several mechanisms to minimize them:


  • Minimum severity filtering: Only send alerts for medium severity and above during the trial period. This lets you validate the system without being overwhelmed.

  • Configurable thresholds per metric: Some metrics are naturally more volatile than others. Don't force them all to use the same threshold.

  • Historical context in alerts: Seeing the last 7 days of data helps you quickly recognize if something is genuinely unusual or just normal variation that happened to cross a threshold.

  • Iterative tuning: Track alert frequency and usefulness. If you're consistently dismissing alerts for a particular metric, increase its threshold.


My target is 2-4 meaningful alerts per week. More than that and I'm probably being too sensitive. Less than that and I might be missing important signals.

The System in Practice

The complete system runs as a scheduled task that executes daily:

  1. Fetch data: Pull the last 90 days of metrics from GA4

  2. Calculate baselines: Compute historical mean for each metric (optionally by day of week)

  3. Detect anomalies: Compare the most recent complete day (2 days ago) against the baseline

  4. Calculate deviation: Determine percentage difference from expected

  5. Classify severity: Apply the severity matrix based on deviation and business impact

  6. Gather context: For detected anomalies, fetch dimensional breakdowns

  7. Generate alerts: Format and send email messages and/or slack messages

  8. Log everything: Keep records for future analysis and threshold tuning


The system is self-sufficient. Once configured and deployed, it requires minimal maintenance - mainly periodic threshold adjustments based on how well it's performing.

Real Impact: Before and After

The practical difference has been significant.

Before the system:

  • Manually checked GA4 3-4 times daily

  • Often discovered issues hours or days after they occurred

  • Spent 30-45 minutes daily on monitoring (often finding nothing unusual)

  • Occasionally missed significant changes that were buried in normal noise

After implementation:

  • System monitors continuously, I check only when alerted

  • Notified within hours of any meaningful change (limited only by the 2-day data lag)

  • Spend about 5 minutes per week reviewing legitimate alerts

  • Haven't missed a significant anomaly (>20% change) since deployment


The time savings alone would justify the effort, but the real value is confidence. I know the system is watching, and I'll be alerted if anything unusual happens. That peace of mind is worth more than the hours saved.

Key Lessons Learned

1. Reality Beats Theory

I started with sophisticated statistical methods because that's what the literature recommended. Real-world data showed me that simpler percentage-based thresholds work better for volatile digital analytics data. Theory matters, but results matter more.


2. Data Quality Outweighs Algorithmic Sophistication

The 2-day lag to ensure complete data was more impactful than any algorithmic improvement I could make. Having accurate data beats having clever algorithms operating on incomplete data.


3. Make It Understandable

A system that alerts on "2.5 standard deviations from mean" requires explanation every time. A system that alerts on ">25% revenue change" is immediately actionable. Design for the humans who will use the system, not for algorithmic elegance.


4. Alert Fatigue Is Real

Ten alerts per day that are mostly noise is worse than two alerts per week that consistently matter. Better to err on the side of fewer, higher-quality alerts than more alerts that train people to ignore them.


5. Context Makes Alerts Actionable

"Revenue dropped 30%" sends someone to investigate with no clear direction. "Revenue dropped 30%, primarily from mobile organic search traffic" tells them where to start looking. The dimensional breakdown transforms an alert from a problem statement into a diagnostic starting point.

What's Next: Future Enhancements

The current system works well for its intended purpose, but there are several directions for improvement:

  1. Predictive anomalies: Instead of detecting anomalies after they occur, use trend analysis to predict if current patterns will lead to an anomaly in the near future. This shifts from reactive to proactive.

  2. Deeper root cause analysis: Automatically investigate detected anomalies more thoroughly, checking correlations with external factors like weather, news events, or competitor actions.

  3. Automated remediation: For specific, well-defined anomaly types (like a broken tracking implementation or a paused campaign that should be running), automatically take corrective action within defined guardrails.

  4. Natural language interface: Query the system conversationally: "What caused the revenue spike last Tuesday?" and receive an AI-generated analysis.

  5. Continuous learning: Track which alerts led to action and which were dismissed. Use this feedback to automatically tune thresholds over time

Practical Guidance for Implementation

If you're considering building something similar, here's a realistic roadmap based on what worked for me:


Week 1: Foundation

  • Set up GA4 Data API access or BigQuery export

  • Create and test service account credentials

  • Verify you can pull data for your key metrics

  • Choose 3-5 metrics to monitor initially (you can always add more later)

Week 2: Detection Logic

  • Implement percentage-based deviation calculation

  • Set initial thresholds (start with ±20-25% as a baseline)

  • Add severity classification logic

  • Test with historical data to validate your thresholds are reasonable

Week 3: Alerting

  • Configure Slack webhook or email SMTP settings

  • Create alert message templates with proper formatting

  • Test alert delivery end-to-end

  • Set up routing by severity level

Week 4: Automation

  • Schedule automated daily runs

  • Add comprehensive logging

  • Deploy to an always-on environment (cloud VM or server)

  • Monitor for a week and tune thresholds based on alert frequency

Ongoing: Refinement

  • Track which alerts were useful vs. noise

  • Adjust thresholds quarterly based on performance

  • Gradually add additional metrics

  • Enhance with dimensional analysis and context

Technical Considerations

A few practical technical notes without diving into implementation details:


Language: Python is well-suited for this. Strong data science libraries, mature GA4 SDKs, and excellent community support.

Infrastructure:

  • Development: Run on your local machine for testing

  • Production: Cloud VM (Google Cloud, AWS, DigitalOcean) for 24/7 operation

  • Enterprise: Containerized deployment with Kubernetes for scale

Data storage:

  • Minimal: Log files only for debugging

  • Better: PostgreSQL for storing anomaly history

  • Best: Data warehouse integration for comprehensive historical analysis

Scheduling:

  • Simple: Python's schedule library or cron jobs

  • Production: Apache Airflow or Prefect for robust workflow management

Cost estimate for production deployment:

  • GA4 Data API: Free within rate limits

  • BigQuery: ~$0-50/month for typical usage

  • Cloud hosting: ~$20-100/month depending on requirements

  • Total: Under $200/month for a full production system

Bringing It All Together

This project started as a way to validate the agentic AI concepts I wrote about previously. What I found is that the practical challenges and decisions involved in building a real system teach you things that no amount of theoretical discussion can convey.


The key insights:


  1. Percentage-based thresholds proved more effective than statistical methods for volatile digital analytics data.

  2. Day-of-week comparisons dramatically reduce false positives by respecting natural business rhythms.

  3. 90-day historical baselines provide stable reference points without becoming stale.

  4. Two-day data lag ensures accuracy by waiting for complete data rather than reacting to incomplete information.

  5. Rich, contextual alerts with dimensional breakdowns make notifications actionable rather than just informative.

  6. Simplicity matters. The most sophisticated approach isn't always the most effective. Sometimes straightforward methods that people can understand and trust work better than complex algorithms that remain opaque.


If you're spending significant time manually monitoring dashboards and wondering if you're missing important changes in your data, consider building or implementing an automated anomaly detection system. The initial investment of time pays back quickly, and the ongoing benefit of having a reliable system watching your data continuously is substantial.


The future of analytics isn't more dashboards to manually check. It's intelligent systems that continuously monitor your data and alert you when human attention and expertise are actually needed. This project showed me that future is already achievable with current technology - you just need to build it.

Comments


Got a question or feedback? Drop me a line and let me know what you think!

Thank You for Your Message!

© 2021 Analytics Digitally. All rights reserved.

bottom of page