Inference Services
Our inference services provide the foundation for all AI-powered features in OurOtters. Built on a flexible provider system with intelligent orchestration capabilities, we prioritize cost-effectiveness and performance while maintaining enterprise-grade reliability.
Overview
Our inference architecture supports multiple AI providers through a unified interface, with intelligent routing and fallback mechanisms. While we leverage our proprietary RheaDuoBoost orchestration system for enhanced quality, our core inference strategy focuses on cost-effective open source models, particularly Gemma 3B and Llama 7B.
Primary Inference Models
Gemma 3B - Our Efficiency Champion
Google's Gemma 3B serves as our primary model for most inference tasks, offering exceptional performance per dollar.
Key Characteristics:
- Model Size: 3 billion parameters
- Inference Speed: ~50-100ms response time
- Cost: Ultra-low (~$0.01 per million tokens)
- Strengths: General conversation, content moderation, simple Q&A
- Use Cases: Real-time chat, message filtering, basic assistance
Technical Specifications:
{
"model": "google/gemma-3b",
"parameters": "3B",
"context_length": 8192,
"training_data": "Web data, books, code",
"optimizations": [
"Efficient attention mechanisms",
"Optimized for mobile/edge deployment",
"Low memory footprint"
],
"costs": {
"input_per_mtok": 0.01,
"output_per_mtok": 0.02
}
}Llama 7B - Our Reasoning Powerhouse
Meta's Llama 7B handles more complex reasoning tasks requiring deeper understanding and analysis.
Key Characteristics:
- Model Size: 7 billion parameters
- Inference Speed: ~200-400ms response time
- Cost: Low (~$0.05 per million tokens)
- Strengths: Complex reasoning, analysis, document understanding
- Use Cases: Legal document analysis, conflict resolution, detailed explanations
Technical Specifications:
{
"model": "meta-llama/llama-7b",
"parameters": "7B",
"context_length": 4096,
"training_data": "Diverse internet data, academic papers",
"optimizations": [
"Instruction-following fine-tuning",
"Enhanced reasoning capabilities",
"Better factual accuracy"
],
"costs": {
"input_per_mtok": 0.02,
"output_per_mtok": 0.04
}
}Provider Architecture
Unified Provider Interface
Our inference system abstracts away provider differences through a standardized interface:
interface InferenceProvider {
name: string;
models: ModelCatalog;
// Core inference methods
generate(request: InferenceRequest): Promise<InferenceResponse>;
generateStream(request: InferenceRequest): AsyncIterator<InferenceChunk>;
// Utility methods
isHealthy(): Promise<boolean>;
getModelInfo(modelId: string): ModelInfo;
calculateCost(tokens: TokenUsage): number;
}
interface InferenceRequest {
model: string;
messages: ChatMessage[];
parameters?: {
temperature?: number;
max_tokens?: number;
top_p?: number;
frequency_penalty?: number;
};
metadata?: Record<string, any>;
}Supported Providers
| Provider | Primary Models | Strengths | Use Cases |
|---|---|---|---|
| OpenRouter | Gemma 3B, Llama 7B | Cost-effective, model diversity | Primary inference, experimentation |
| Ollama | Local models | Privacy, no API costs | Development, sensitive data |
| OpenAI | GPT-4o-mini | Reliability, advanced features | Fallback, premium features |
| Anthropic | Claude Haiku | Safety, instruction following | Content moderation, review |
Smart Model Selection
Our system automatically selects the optimal model based on request characteristics:
class InferenceRouter {
selectModel(request: InferenceRequest): string {
const complexity = this.analyzeComplexity(request);
const urgency = this.analyzeUrgency(request);
const context = this.analyzeContext(request);
// Simple queries -> Gemma 3B
if (complexity === 'low' && context.length < 1000) {
return 'openrouter/google/gemma-3b';
}
// Complex reasoning -> Llama 7B
if (complexity === 'high' || context.includes('analysis')) {
return 'openrouter/meta-llama/llama-7b';
}
// Premium features -> RheaDuoBoost
if (request.metadata?.enhancement === 'premium') {
return 'rheaduoboost';
}
// Default to cost-effective option
return 'openrouter/google/gemma-3b';
}
}Supabase Edge Functions Integration
Inference Edge Function
Our inference services are deployed as Supabase Edge Functions for seamless integration:
// supabase/functions/ai-inference/index.ts
import { serve } from "https://deno.land/std@0.168.0/http/server.ts";
import { InferenceService } from "./inference-service.ts";
serve(async (req) => {
try {
const { query, context, options = {} } = await req.json();
// Initialize inference service
const inference = new InferenceService();
// Smart model selection
const selectedModel = inference.selectOptimalModel(query, context, options);
// Generate response
const result = await inference.generate({
model: selectedModel,
messages: [
{ role: "system", content: context.systemPrompt || "You are a helpful assistant." },
{ role: "user", content: query }
],
parameters: {
temperature: options.temperature || 0.7,
max_tokens: options.maxTokens || 1000
}
});
return new Response(JSON.stringify({
response: result.content,
model: selectedModel,
usage: result.usage,
cost: result.cost
}), {
headers: { 'Content-Type': 'application/json' }
});
} catch (error) {
return new Response(JSON.stringify({
error: error.message,
fallback: "I apologize, but I'm experiencing technical difficulties."
}), {
status: 500,
headers: { 'Content-Type': 'application/json' }
});
}
});Streaming Inference
For real-time applications, we support streaming responses:
// supabase/functions/ai-stream/index.ts
serve(async (req) => {
const { query, context } = await req.json();
const stream = new ReadableStream({
async start(controller) {
const inference = new InferenceService();
for await (const chunk of inference.generateStream({
model: "openrouter/google/gemma-3b",
messages: [
{ role: "system", content: context.systemPrompt },
{ role: "user", content: query }
]
})) {
controller.enqueue(new TextEncoder().encode(
`data: ${JSON.stringify(chunk)}\n\n`
));
}
controller.close();
}
});
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive'
}
});
});Cost Optimization Strategy
Token Efficiency
Our models are selected specifically for their token efficiency:
Typical Request Costs:
Gemma 3B Request (500 input tokens, 200 output tokens):
- Input: 500 Ă— $0.01 Ă· 1,000,000 = $0.000005
- Output: 200 Ă— $0.02 Ă· 1,000,000 = $0.000004
- Total: $0.000009 (0.0009¢)
Llama 7B Request (1000 input tokens, 400 output tokens):
- Input: 1000 Ă— $0.02 Ă· 1,000,000 = $0.00002
- Output: 400 Ă— $0.04 Ă· 1,000,000 = $0.000016
- Total: $0.000036 (0.0036¢)Intelligent Caching
We implement multi-level caching to reduce inference costs:
interface CacheStrategy {
// Level 1: Response caching for identical queries
responseCache: Map<string, CachedResponse>;
// Level 2: Embedding cache for semantic similarity
embeddingCache: Map<string, number[]>;
// Level 3: Template cache for common patterns
templateCache: Map<string, ResponseTemplate>;
}
class InferenceCache {
async getCachedResponse(query: string): Promise<string | null> {
// Check exact match first
const exactMatch = await this.responseCache.get(query);
if (exactMatch && !this.isExpired(exactMatch)) {
return exactMatch.response;
}
// Check semantic similarity
const embedding = await this.getEmbedding(query);
const similarQueries = await this.findSimilar(embedding, 0.95);
if (similarQueries.length > 0) {
return similarQueries[0].response;
}
return null;
}
}Cost Monitoring
Real-time cost tracking across all inference requests:
interface CostTracker {
trackRequest(request: InferenceRequest, response: InferenceResponse): void;
getDailyCosts(): DailyCostSummary;
getModelCosts(): ModelCostBreakdown;
alertOnThreshold(threshold: number): void;
}
// Daily cost breakdown
{
"date": "2025-01-01",
"total_cost": 2.47,
"models": {
"gemma-3b": {
"requests": 1247,
"cost": 0.85,
"avg_cost_per_request": 0.00068
},
"llama-7b": {
"requests": 324,
"cost": 1.23,
"avg_cost_per_request": 0.0038
},
"rheaduoboost": {
"requests": 89,
"cost": 0.39,
"avg_cost_per_request": 0.0044
}
}
}Quality Assurance
Model Performance Monitoring
Continuous monitoring of model performance across key metrics:
interface PerformanceMetrics {
responseTime: number; // Average response time in ms
errorRate: number; // Percentage of failed requests
userSatisfaction: number; // User rating (1-5)
contextAccuracy: number; // Relevance to conversation context
safetyScore: number; // Content safety rating
}
class QualityMonitor {
async evaluateResponse(
query: string,
response: string,
context: ConversationContext
): Promise<PerformanceMetrics> {
return {
responseTime: this.measureLatency(),
errorRate: this.calculateErrorRate(),
userSatisfaction: await this.getUserFeedback(),
contextAccuracy: await this.evaluateContext(query, response, context),
safetyScore: await this.checkSafety(response)
};
}
}A/B Testing Framework
Continuous model comparison to optimize performance:
class InferenceABTest {
async runComparison(
modelA: string,
modelB: string,
testQueries: string[]
): Promise<ABTestResults> {
const results = await Promise.all(
testQueries.map(async (query) => {
const [responseA, responseB] = await Promise.all([
this.inference.generate({ model: modelA, query }),
this.inference.generate({ model: modelB, query })
]);
return {
query,
modelA: { response: responseA, cost: this.calculateCost(responseA) },
modelB: { response: responseB, cost: this.calculateCost(responseB) },
winner: await this.evaluateQuality(responseA, responseB)
};
})
);
return this.summarizeResults(results);
}
}RheaDuoBoost Integration
While our primary inference focuses on cost-effective single-model requests, RheaDuoBoost provides enhanced quality for premium features through its 4-phase orchestration workflow. This system combines multiple models to achieve premium results at significantly lower cost than using GPT-4 directly.
When RheaDuoBoost is Used:
- Complex co-parenting advice requiring nuanced understanding
- Document analysis requiring multiple perspectives
- Conflict resolution requiring careful consideration
- Premium subscription features
Integration Example:
// Enhanced inference for premium features
if (request.metadata?.premium === true) {
const rheaResult = await rheaDuoBoost.process({
query: request.query,
primaryModel: "openrouter/meta-llama/llama-7b",
reviewerModel: "openrouter/google/gemma-3b"
});
return {
response: rheaResult.final_response,
enhanced: true,
cost: rheaResult.metadata.total_cost,
phases: rheaResult.metadata.stages.length
};
}Error Handling and Resilience
Graceful Degradation
Multi-level fallback strategy ensures service availability:
class ResilientInference {
async generateWithFallback(request: InferenceRequest): Promise<InferenceResponse> {
const fallbackChain = [
'openrouter/google/gemma-3b', // Primary choice
'openrouter/meta-llama/llama-7b', // Fallback 1
'openai/gpt-4o-mini', // Fallback 2
'ollama/local-model' // Emergency fallback
];
for (const model of fallbackChain) {
try {
return await this.inference.generate({
...request,
model: model
});
} catch (error) {
console.warn(`Model ${model} failed: ${error.message}`);
if (model === fallbackChain[fallbackChain.length - 1]) {
// Last resort: return cached response or graceful message
return this.getEmergencyResponse(request);
}
}
}
}
private getEmergencyResponse(request: InferenceRequest): InferenceResponse {
return {
content: "I apologize, but I'm experiencing technical difficulties. Please try again in a moment.",
model: "emergency-fallback",
usage: { input_tokens: 0, output_tokens: 0 },
cost: 0
};
}
}Health Monitoring
Continuous health checks ensure optimal performance:
class HealthMonitor {
async checkProviderHealth(): Promise<ProviderHealthStatus> {
const providers = ['openrouter', 'ollama', 'openai', 'anthropic'];
const healthChecks = await Promise.allSettled(
providers.map(async (provider) => {
const start = Date.now();
await this.pingProvider(provider);
const latency = Date.now() - start;
return {
provider,
status: 'healthy',
latency,
timestamp: new Date().toISOString()
};
})
);
return this.aggregateHealthStatus(healthChecks);
}
}Future Enhancements
Planned Improvements
- Model Quantization: 4-bit and 8-bit quantized models for even lower costs
- Edge Deployment: Local inference for private data processing
- Custom Fine-tuning: Domain-specific models for co-parenting scenarios
- Multimodal Support: Image and document understanding capabilities
- Real-time Learning: Adaptive models that improve with user feedback
Scaling Strategy
Our inference services provide the reliable, cost-effective foundation that enables OurOtters to democratize access to AI-powered co-parenting assistance. By focusing on efficient open source models and intelligent orchestration, we deliver premium experiences at accessible costs.