Inference Services

Our inference services provide the foundation for all AI-powered features in OurOtters. Built on a flexible provider system with intelligent orchestration capabilities, we prioritize cost-effectiveness and performance while maintaining enterprise-grade reliability.

Overview

Our inference architecture supports multiple AI providers through a unified interface, with intelligent routing and fallback mechanisms. While we leverage our proprietary RheaDuoBoost orchestration system for enhanced quality, our core inference strategy focuses on cost-effective open source models, particularly Gemma 3B and Llama 7B.

Primary Inference Models

Gemma 3B - Our Efficiency Champion

Google's Gemma 3B serves as our primary model for most inference tasks, offering exceptional performance per dollar.

Key Characteristics:

Model Size: 3 billion parameters
Inference Speed: ~50-100ms response time
Cost: Ultra-low (~$0.01 per million tokens)
Strengths: General conversation, content moderation, simple Q&A
Use Cases: Real-time chat, message filtering, basic assistance

Technical Specifications:

{
  "model": "google/gemma-3b",
  "parameters": "3B",
  "context_length": 8192,
  "training_data": "Web data, books, code",
  "optimizations": [
    "Efficient attention mechanisms",
    "Optimized for mobile/edge deployment",
    "Low memory footprint"
  ],
  "costs": {
    "input_per_mtok": 0.01,
    "output_per_mtok": 0.02
  }
}

Llama 7B - Our Reasoning Powerhouse

Meta's Llama 7B handles more complex reasoning tasks requiring deeper understanding and analysis.

Key Characteristics:

Model Size: 7 billion parameters
Inference Speed: ~200-400ms response time
Cost: Low (~$0.05 per million tokens)
Strengths: Complex reasoning, analysis, document understanding
Use Cases: Legal document analysis, conflict resolution, detailed explanations

Technical Specifications:

{
  "model": "meta-llama/llama-7b",
  "parameters": "7B", 
  "context_length": 4096,
  "training_data": "Diverse internet data, academic papers",
  "optimizations": [
    "Instruction-following fine-tuning",
    "Enhanced reasoning capabilities", 
    "Better factual accuracy"
  ],
  "costs": {
    "input_per_mtok": 0.02,
    "output_per_mtok": 0.04
  }
}

Provider Architecture

Unified Provider Interface

Our inference system abstracts away provider differences through a standardized interface:

interface InferenceProvider {
  name: string;
  models: ModelCatalog;
  
  // Core inference methods
  generate(request: InferenceRequest): Promise<InferenceResponse>;
  generateStream(request: InferenceRequest): AsyncIterator<InferenceChunk>;
  
  // Utility methods
  isHealthy(): Promise<boolean>;
  getModelInfo(modelId: string): ModelInfo;
  calculateCost(tokens: TokenUsage): number;
}
 
interface InferenceRequest {
  model: string;
  messages: ChatMessage[];
  parameters?: {
    temperature?: number;
    max_tokens?: number;
    top_p?: number;
    frequency_penalty?: number;
  };
  metadata?: Record<string, any>;
}

Supported Providers

Provider	Primary Models	Strengths	Use Cases
OpenRouter	Gemma 3B, Llama 7B	Cost-effective, model diversity	Primary inference, experimentation
Ollama	Local models	Privacy, no API costs	Development, sensitive data
OpenAI	GPT-4o-mini	Reliability, advanced features	Fallback, premium features
Anthropic	Claude Haiku	Safety, instruction following	Content moderation, review

Smart Model Selection

Our system automatically selects the optimal model based on request characteristics:

class InferenceRouter {
  selectModel(request: InferenceRequest): string {
    const complexity = this.analyzeComplexity(request);
    const urgency = this.analyzeUrgency(request);
    const context = this.analyzeContext(request);
    
    // Simple queries -> Gemma 3B
    if (complexity === 'low' && context.length < 1000) {
      return 'openrouter/google/gemma-3b';
    }
    
    // Complex reasoning -> Llama 7B
    if (complexity === 'high' || context.includes('analysis')) {
      return 'openrouter/meta-llama/llama-7b';
    }
    
    // Premium features -> RheaDuoBoost
    if (request.metadata?.enhancement === 'premium') {
      return 'rheaduoboost';
    }
    
    // Default to cost-effective option
    return 'openrouter/google/gemma-3b';
  }
}

Supabase Edge Functions Integration

Inference Edge Function

Our inference services are deployed as Supabase Edge Functions for seamless integration:

// supabase/functions/ai-inference/index.ts
import { serve } from "https://deno.land/std@0.168.0/http/server.ts";
import { InferenceService } from "./inference-service.ts";
 
serve(async (req) => {
  try {
    const { query, context, options = {} } = await req.json();
    
    // Initialize inference service
    const inference = new InferenceService();
    
    // Smart model selection
    const selectedModel = inference.selectOptimalModel(query, context, options);
    
    // Generate response
    const result = await inference.generate({
      model: selectedModel,
      messages: [
        { role: "system", content: context.systemPrompt || "You are a helpful assistant." },
        { role: "user", content: query }
      ],
      parameters: {
        temperature: options.temperature || 0.7,
        max_tokens: options.maxTokens || 1000
      }
    });
    
    return new Response(JSON.stringify({
      response: result.content,
      model: selectedModel,
      usage: result.usage,
      cost: result.cost
    }), {
      headers: { 'Content-Type': 'application/json' }
    });
    
  } catch (error) {
    return new Response(JSON.stringify({ 
      error: error.message,
      fallback: "I apologize, but I'm experiencing technical difficulties."
    }), {
      status: 500,
      headers: { 'Content-Type': 'application/json' }
    });
  }
});

Streaming Inference

For real-time applications, we support streaming responses:

// supabase/functions/ai-stream/index.ts
serve(async (req) => {
  const { query, context } = await req.json();
  
  const stream = new ReadableStream({
    async start(controller) {
      const inference = new InferenceService();
      
      for await (const chunk of inference.generateStream({
        model: "openrouter/google/gemma-3b",
        messages: [
          { role: "system", content: context.systemPrompt },
          { role: "user", content: query }
        ]
      })) {
        controller.enqueue(new TextEncoder().encode(
          `data: ${JSON.stringify(chunk)}\n\n`
        ));
      }
      
      controller.close();
    }
  });
  
  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive'
    }
  });
});

Cost Optimization Strategy

Token Efficiency

Our models are selected specifically for their token efficiency:

Typical Request Costs:

Gemma 3B Request (500 input tokens, 200 output tokens):
- Input: 500 × $0.01 ÷ 1,000,000 = $0.000005
- Output: 200 × $0.02 ÷ 1,000,000 = $0.000004
- Total: $0.000009 (0.0009¢)

Llama 7B Request (1000 input tokens, 400 output tokens):
- Input: 1000 × $0.02 ÷ 1,000,000 = $0.00002
- Output: 400 × $0.04 ÷ 1,000,000 = $0.000016
- Total: $0.000036 (0.0036¢)

Intelligent Caching

We implement multi-level caching to reduce inference costs:

interface CacheStrategy {
  // Level 1: Response caching for identical queries
  responseCache: Map<string, CachedResponse>;
  
  // Level 2: Embedding cache for semantic similarity
  embeddingCache: Map<string, number[]>;
  
  // Level 3: Template cache for common patterns
  templateCache: Map<string, ResponseTemplate>;
}
 
class InferenceCache {
  async getCachedResponse(query: string): Promise<string | null> {
    // Check exact match first
    const exactMatch = await this.responseCache.get(query);
    if (exactMatch && !this.isExpired(exactMatch)) {
      return exactMatch.response;
    }
    
    // Check semantic similarity
    const embedding = await this.getEmbedding(query);
    const similarQueries = await this.findSimilar(embedding, 0.95);
    
    if (similarQueries.length > 0) {
      return similarQueries[0].response;
    }
    
    return null;
  }
}

Cost Monitoring

Real-time cost tracking across all inference requests:

interface CostTracker {
  trackRequest(request: InferenceRequest, response: InferenceResponse): void;
  getDailyCosts(): DailyCostSummary;
  getModelCosts(): ModelCostBreakdown;
  alertOnThreshold(threshold: number): void;
}
 
// Daily cost breakdown
{
  "date": "2025-01-01",
  "total_cost": 2.47,
  "models": {
    "gemma-3b": {
      "requests": 1247,
      "cost": 0.85,
      "avg_cost_per_request": 0.00068
    },
    "llama-7b": {
      "requests": 324,
      "cost": 1.23,
      "avg_cost_per_request": 0.0038
    },
    "rheaduoboost": {
      "requests": 89,
      "cost": 0.39,
      "avg_cost_per_request": 0.0044
    }
  }
}

Quality Assurance

Model Performance Monitoring

Continuous monitoring of model performance across key metrics:

interface PerformanceMetrics {
  responseTime: number;        // Average response time in ms
  errorRate: number;           // Percentage of failed requests
  userSatisfaction: number;    // User rating (1-5)
  contextAccuracy: number;     // Relevance to conversation context
  safetyScore: number;         // Content safety rating
}
 
class QualityMonitor {
  async evaluateResponse(
    query: string, 
    response: string, 
    context: ConversationContext
  ): Promise<PerformanceMetrics> {
    return {
      responseTime: this.measureLatency(),
      errorRate: this.calculateErrorRate(),
      userSatisfaction: await this.getUserFeedback(),
      contextAccuracy: await this.evaluateContext(query, response, context),
      safetyScore: await this.checkSafety(response)
    };
  }
}

A/B Testing Framework

Continuous model comparison to optimize performance:

class InferenceABTest {
  async runComparison(
    modelA: string,
    modelB: string, 
    testQueries: string[]
  ): Promise<ABTestResults> {
    const results = await Promise.all(
      testQueries.map(async (query) => {
        const [responseA, responseB] = await Promise.all([
          this.inference.generate({ model: modelA, query }),
          this.inference.generate({ model: modelB, query })
        ]);
        
        return {
          query,
          modelA: { response: responseA, cost: this.calculateCost(responseA) },
          modelB: { response: responseB, cost: this.calculateCost(responseB) },
          winner: await this.evaluateQuality(responseA, responseB)
        };
      })
    );
    
    return this.summarizeResults(results);
  }
}

RheaDuoBoost Integration

While our primary inference focuses on cost-effective single-model requests, RheaDuoBoost provides enhanced quality for premium features through its 4-phase orchestration workflow. This system combines multiple models to achieve premium results at significantly lower cost than using GPT-4 directly.

When RheaDuoBoost is Used:

Complex co-parenting advice requiring nuanced understanding
Document analysis requiring multiple perspectives
Conflict resolution requiring careful consideration
Premium subscription features

Integration Example:

// Enhanced inference for premium features
if (request.metadata?.premium === true) {
  const rheaResult = await rheaDuoBoost.process({
    query: request.query,
    primaryModel: "openrouter/meta-llama/llama-7b",
    reviewerModel: "openrouter/google/gemma-3b"
  });
  
  return {
    response: rheaResult.final_response,
    enhanced: true,
    cost: rheaResult.metadata.total_cost,
    phases: rheaResult.metadata.stages.length
  };
}

Error Handling and Resilience

Graceful Degradation

Multi-level fallback strategy ensures service availability:

class ResilientInference {
  async generateWithFallback(request: InferenceRequest): Promise<InferenceResponse> {
    const fallbackChain = [
      'openrouter/google/gemma-3b',     // Primary choice
      'openrouter/meta-llama/llama-7b', // Fallback 1  
      'openai/gpt-4o-mini',             // Fallback 2
      'ollama/local-model'              // Emergency fallback
    ];
    
    for (const model of fallbackChain) {
      try {
        return await this.inference.generate({
          ...request,
          model: model
        });
      } catch (error) {
        console.warn(`Model ${model} failed: ${error.message}`);
        
        if (model === fallbackChain[fallbackChain.length - 1]) {
          // Last resort: return cached response or graceful message
          return this.getEmergencyResponse(request);
        }
      }
    }
  }
  
  private getEmergencyResponse(request: InferenceRequest): InferenceResponse {
    return {
      content: "I apologize, but I'm experiencing technical difficulties. Please try again in a moment.",
      model: "emergency-fallback",
      usage: { input_tokens: 0, output_tokens: 0 },
      cost: 0
    };
  }
}

Health Monitoring

Continuous health checks ensure optimal performance:

class HealthMonitor {
  async checkProviderHealth(): Promise<ProviderHealthStatus> {
    const providers = ['openrouter', 'ollama', 'openai', 'anthropic'];
    
    const healthChecks = await Promise.allSettled(
      providers.map(async (provider) => {
        const start = Date.now();
        await this.pingProvider(provider);
        const latency = Date.now() - start;
        
        return {
          provider,
          status: 'healthy',
          latency,
          timestamp: new Date().toISOString()
        };
      })
    );
    
    return this.aggregateHealthStatus(healthChecks);
  }
}

Future Enhancements

Planned Improvements

Model Quantization: 4-bit and 8-bit quantized models for even lower costs
Edge Deployment: Local inference for private data processing
Custom Fine-tuning: Domain-specific models for co-parenting scenarios
Multimodal Support: Image and document understanding capabilities
Real-time Learning: Adaptive models that improve with user feedback

Scaling Strategy

Our inference services provide the reliable, cost-effective foundation that enables OurOtters to democratize access to AI-powered co-parenting assistance. By focusing on efficient open source models and intelligent orchestration, we deliver premium experiences at accessible costs.

RheaDuoBoost Resilience Architecture