# bg_handle_event Queue Bottleneck Analysis

## 🚨 Problem Statement
- #### 2. `get_first_party_orgs()` - HIGH  
**Location**: `lib/helper.py:1345-1378`
**Issue**: When cache expires (1 hour TTL), makes 10,000+ Redis calls
```python
# Calls db.organisation_user_map.distinct("token") -> ~10K orgs
# Then get_features_enabled_from_hybrid_cache(org) for EACH org
# = 10K+ Redis calls in single function blocking main queue
```
**Logging Added**:
- ERROR level if Redis operations >1s 
- WARNING if total time >500ms
- Tracks: DB query time, Redis call count, average latency per call

#### 3. `build_mob_avenue_record()` - MEDIUM
**Location**: `lib/helper.py:1628-1655`
**Issue**: External service calls can be slow
```python
# External services:
# - parse_ua(ua) -> User agent parsing library
# - geoip_resolve(ip) -> GeoLite2 database file read
```
**Logging Added**:
- WARNING if parse_ua() >10ms
- WARNING if geoip_resolve() >10ms

### Priority 2 Bottlenecks (INSTRUMENTED)

#### 1. `maintain_org_contacts()` - HIGH P2 ✅ INSTRUMENTED
**Location**: `lib/contacts_helper.py:88-450`
**Called**: Per network user in main event processing loop
**Operations**: Database upserts, contact validation, Redis pipeline
**Logging Added**: 
- WARNING if total time >100ms
- Tracks: execution time, contact count correlation

#### 2. `loop_network_reached_users()` - MEDIUM P2 ✅ INSTRUMENTED  
**Location**: `lib/events_helper.py:810-899`
**Issue**: Aggregated impact of per-user operations
**Logging Added**: 
- WARNING if total time >100ms
- Tracks: user count, total contacts processed

#### 3. `get_cached_network_reached_users()` - MEDIUM P2 ✅ INSTRUMENTED
**Location**: `lib/events_helper.py:935-984`
**Issue**: Database queries for network user lookup
**Logging Added**:
- WARNING if total time >100ms
- Tracks: DB query count, returned user count

#### 4. `get_cached_contacts_for_cookie_ids()` - MEDIUM P2 ✅ INSTRUMENTED
**Location**: `lib/events_helper.py:1055-1117`
**Issue**: Contact lookup queries
**Logging Added**:
- WARNING if total time >100ms
- Tracks: DB query count, contacts found

#### 5. `run_exports()` - MEDIUM P2 ✅ INSTRUMENTED
**Location**: Multiple locations in `lib/events_helper.py`
**Issue**: Export processing overhead
**Logging Added**:
- WARNING if total time >100ms
- Separate logging for HIU exports vs regular exports

#### 6. `send_to_ray()` - LOW P2 ✅ INSTRUMENTED
**Location**: `lib/events_helper.py:680-712`
**Issue**: External Ray service calls
**Logging Added**:
- WARNING if total time >100ms  
- Tracks: HTTP status codehandle_event` queue piling up and growing (2M+ items backlog)
- **Impact**: Event processing delays, system performance degradation
- **Goal**: Identify and optimize synchronous operations blocking the main queue

## 📊 Progress Summary

### ✅ Phase 1: Avatar Optimization (COMPLETED)
- **Issue**: Individual Avatar compilation taking 11-47s per event
- **Solution**: Implemented batched Avatar compilation 
- **Result**: 70-85% improvement (events now process in 1-9s)
- **Status**: Deployed and working excellently

### ✅ Phase 2: Debugging Cleanup (COMPLETED)
- **Issue**: Removed all debugging instrumentation blocks
- **Files**: `lib/redqueue.py`, `lib/events_helper.py`
- **Status**: Cleaned and deployed

### ✅ Phase 4: Priority 2 Bottleneck Logging (COMPLETED)
- **Approach**: Secondary bottleneck identification after P1 optimization
- **Focus**: Network processing, contact management, exports, external services
- **Status**: Implemented and ready for deployment

## 🔍 Deep Code Analysis Findings

### Critical Discovery: Function Call Patterns
```python
# ❌ BLOCKS main queue (synchronous in bg_handle_event)
get_first_party_orgs()           # 10K+ Redis calls when cache expires - P1 OPTIMIZED
generate_customer_for_nitro_id() # Up to 5 DB queries - P1 OPTIMIZED
maintain_org_contacts()          # Per-user contact processing - P2 INSTRUMENTED
loop_network_reached_users()     # Network user aggregation - P2 INSTRUMENTED
run_exports()                    # Export processing - P2 INSTRUMENTED

# ✅ SEPARATE queue (doesn't block main queue)
@bg(subtask=True)                # Functions with subtask=True have own queues
```

### Priority 1 Bottlenecks (INSTRUMENTED & OPTIMIZED)

#### 1. `generate_customer_for_nitro_id()` - CRITICAL ✅ OPTIMIZED
**Location**: `lib/helper.py:66-210`
**Issue**: Was making up to 6 database queries per call
**Optimization Applied**:
- Added dedicated customer_cache (5min TTL) with negative caching
- Reduced queries from 6→1-3 queries (50-83% reduction)
- Fast path for complete event_data (0 DB queries)
- Query reordering: org_contacts first, AIU last

**Performance Results**:
```
BEFORE: 60-204ms with 6 queries
AFTER:  5-93ms with 1-5 queries (60-70% improvement)
FAST_PATH: 5-7ms with 0 queries
```

#### 1. `get_first_party_orgs()` - CRITICAL
**Location**: `lib/helper.py:1345-1378`
**Issue**: When cache expires (1 hour TTL), makes 10,000+ Redis calls
```python
# Calls db.organisation_user_map.distinct("token") -> ~10K orgs
# Then get_features_enabled_from_hybrid_cache(org) for EACH org
# = 10K+ Redis calls in single function blocking main queue
```
**Logging Added**:
- ERROR level if Redis operations >1s 
- WARNING if total time >500ms
- Tracks: DB query time, Redis call count, average latency per call

#### 2. `generate_customer_for_nitro_id()` - HIGH  
**Location**: `lib/helper.py:60-138`
**Issue**: Makes up to 5 database queries per call
```python
# Potential queries:
# 1. get_aiu_for_nitro_id()
# 2. get_identified_name_optimized()  
# 3. get_identified_contact_from_org_contacts_optimized() (EMAIL)
# 4. get_identified_contact_from_org_contacts_optimized() (PHONE)
# 5. get_identified_email_optimized()
# 6. get_identified_phone_optimized()
```
**Logging Added**:
- WARNING if total time >50ms
- Tracks: DB query count, total execution time

#### 3. `build_mob_avenue_record()` - MEDIUM
**Location**: `lib/helper.py:1628-1655`
**Issue**: External service calls can be slow
```python
# External services:
# - parse_ua(ua) -> User agent parsing library
# - geoip_resolve(ip) -> GeoLite2 database file read
```
**Logging Added**:
- WARNING if parse_ua() >10ms
- WARNING if geoip_resolve() >10ms

### Priority 2 Candidates (PLANNED)

#### 1. `maintain_org_contacts()` - HIGH P2
**Location**: `lib/contacts_helper.py:88-450`
**Called**: Per network user in main event processing loop
**Operations**: Database upserts, contact validation, Redis pipeline
**Plan**: Track total time >100ms, DB operation count, contact count correlation

#### 2. Network User Loop Processing - MEDIUM P2
**Location**: `lib/events_helper.py:810-850` (loop_network_reached_users)
**Issue**: Aggregated impact of per-user operations
**Plan**: Loop-level timing, user count correlation

## 🎯 Current Priority 1 & 2 Logging Strategy

### Log Levels Used
- **ERROR**: Critical bottlenecks (>1s Redis operations)
- **WARNING**: Significant bottlenecks (P1: >50ms, P2: >100ms operations)  
- **INFO**: Context for correlation, optimization results

### Log Format Examples
```bash
# P1 Optimized Results:
OPTIMIZED: generate_customer_for_nitro_id took 0.026s with 1 DB queries (cached for 5min)
FAST_PATH: generate_customer_for_nitro_id completed from event_data in 0.005s with 0 DB queries

# P1 Cache Issues:
CACHE_SLOW: customer_cache.get took 0.014s for key cust:v1:b2100c5d...

# P1 Critical Issues:
BOTTLENECK_P1_CRITICAL: get_first_party_orgs Redis calls took 12.453s for 10247 calls

# P2 Network Processing:
BOTTLENECK_P2: loop_network_reached_users took 0.156s for 247 users with 1205 contacts
BOTTLENECK_P2: maintain_org_contacts took 0.134s for nitro_id=a1b2c3d4... with 3 contacts

# P2 Export Processing:
BOTTLENECK_P2: run_exports took 0.187s for org_id=12345 event=product_view
BOTTLENECK_P2: send_to_ray took 0.145s for org_token=abcd1234... status=200
```

## 📈 Expected Monitoring Patterns

### Normal Operation
- Most functions complete within thresholds → No logs
- OPTIMIZED/FAST_PATH logs showing successful optimization
- Occasional WARNING for network/DB latency spikes

### Performance Issues  
- `get_first_party_orgs()` ERROR logs when cache expires (hourly)
- BOTTLENECK_P2 logs identifying next optimization targets
- CACHE_SLOW logs indicating Redis latency issues

### High Load Periods
- Multiple P2 WARNING logs for network processing
- Export processing delays during traffic peaks
- External service timeouts

### Optimization Success Indicators
- Increased FAST_PATH log frequency (0 DB queries)
- Reduced BOTTLENECK_P1 log frequency  
- Lower query counts in OPTIMIZED logs
- Improved cache hit ratios

## 🔧 Deployment Plan

### Phase 3a: Deploy P1 Logging (COMPLETED ✅)
1. ✅ Commit P1 instrumentation 
2. ✅ Deploy to production
3. ✅ Monitor logs for 24-48 hours
4. ✅ Analyze bottleneck patterns

### Phase 3b: P1 Optimization (COMPLETED ✅)
1. ✅ Optimize `generate_customer_for_nitro_id()` with caching
2. ✅ Deploy optimization to production
3. ✅ Verify 60-70% performance improvement

### Phase 4: Priority 2 Implementation (CURRENT)
1. ✅ Implement P2 logging for secondary bottlenecks
2. 🎯 Deploy P2 instrumentation
3. 📊 Monitor P2 patterns to identify next optimization targets

### Phase 5: P2 Optimization (FUTURE)
Based on P2 findings:
- Network processing batching/optimization
- Contact management performance improvements  
- Export processing optimization
- External service caching/retry logic

## 📁 Files Modified

### ✅ Completed Changes
```
lib/helper.py                 # P1 logging + generate_customer_for_nitro_id OPTIMIZATION
lib/hybrid_cache.py          # Added dedicated customer_cache for optimization
lib/events_helper.py         # P2 logging: exports, network processing, Ray service
lib/contacts_helper.py       # P2 logging: maintain_org_contacts
lib/redqueue.py              # Avatar batching optimization (WORKING), debugging cleanup
```

## 🧪 Testing Strategy

### Validation Approach
1. **Syntax Check**: ✅ `python -m py_compile lib/helper.py`
2. **Log Monitoring**: Search for `BOTTLENECK_P1` patterns
3. **Queue Correlation**: Compare log timestamps with queue growth
4. **Performance Impact**: Monitor logging overhead (should be <1ms)

### Key Metrics to Track
- Queue backlog size before/after optimization
- `get_first_party_orgs` call frequency and timing
- Database query latency trends
- External service response times

## 🎯 Success Criteria

### Phase 3 Success (COMPLETED ✅)
- [x] P1 logging deployed without errors
- [x] Clear visibility into bottleneck patterns  
- [x] Data-driven optimization implemented (generate_customer_for_nitro_id)
- [x] 60-70% performance improvement achieved

### Phase 4 Success (CURRENT)
- [ ] P2 logging deployed without errors
- [ ] Secondary bottleneck identification
- [ ] Data-driven P2 optimization targets identified
- [ ] Further queue performance improvements

### Overall Success
- [x] Major optimization win achieved (customer generation 60-70% faster)
- [ ] Queue backlog reduction >50% (monitoring ongoing)
- [ ] Event processing time <5s average (partially achieved)
- [ ] Elimination of >10s processing spikes (monitoring)
- [ ] System stability during high load

## 📝 Notes

### Key Insights
- **✅ Avatar optimization was huge win**: 70-85% improvement in batched processing
- **✅ generate_customer_for_nitro_id optimization major success**: 60-70% improvement with caching
- **✅ P1 optimization cycle complete**: From bottleneck identification → logging → optimization → deployment → verification
- **✅ Real code analysis beats assumptions**: Always examine actual implementations
- **🎯 P2 logging ready**: Comprehensive instrumentation for next optimization cycle

### Architecture Understanding
- `bg_handle_event` (subscribers=12) processes main event queue
- Functions with `@bg(subtask=True)` have separate queues
- Cache expiration can cause massive performance cliffs
- Batching patterns show dramatic improvements (Avatar case study)
- HybridCache with negative caching extremely effective for customer data

### Performance Results Summary
```
generate_customer_for_nitro_id BEFORE: 60-204ms, 6 DB queries
generate_customer_for_nitro_id AFTER:  5-93ms, 1-5 DB queries
Best case FAST_PATH:               5-7ms, 0 DB queries (event_data complete)
Cache hits:                        ~14ms Redis latency (room for improvement)
```

---
**Last Updated**: August 12, 2025  
**Current Branch**: `feat/queue-debugging-instrumentation`  
**Current Phase**: Priority 2 (P2) bottleneck logging implementation complete
**Next Action**: Deploy P2 logging and monitor for secondary optimization targets
