What we're testing right now
Real experiments with real results. We run structured tests on model performance, agent behavior, memory retrieval, and training methods — then share what we find.
Running, completed, and planned
Multi-Agent Task Routing
Testing dynamic task distribution across specialized agents. Instead of one large model doing everything, we route subtasks to purpose-built micromodels and measure accuracy, latency, and cost.
Distillation Efficiency Benchmarks
Measuring how much capability a 1B-parameter student model retains after distilling from a 70B teacher. Tracking performance across reasoning, summarization, and classification tasks.
DomeAI Retrieval Latency
Benchmarking memory retrieval speeds in DomeAI as context stores scale from 10k to 10M entries. Optimizing vector search and cache strategies for sub-100ms recall.
Long-Context Faithfulness
Tested how faithfully models follow instructions placed at varying positions within 128k-token contexts. Results informed our prompt construction patterns for production agents.
Synthetic Data Quality Scoring
Designing a scoring rubric for synthetic training data. The goal is an automated pipeline that rates generated examples on accuracy, diversity, and difficulty before they enter training sets.
Agent Self-Correction Loops
Running agents with a "reflection" step after each action. Comparing task completion rates and error recovery between agents with and without self-correction capabilities.
How we run experiments
Every experiment follows a structured process so results are reproducible and actionable — not just interesting.
Hypothesis
Start with a clear, testable question. What are we trying to prove or disprove?
Design
Define metrics, control variables, dataset size, and success criteria before running anything.
Execute
Run the experiment with logging at every step. Capture raw data, not just conclusions.
Analyze & Share
Publish findings internally (and sometimes publicly). Negative results are documented too.
Notable findings so far
Distilled models retain 87% accuracy on classification
Our early distillation experiments showed that a 1.3B-parameter model can match a 70B model on binary classification tasks after targeted fine-tuning — at 40x lower inference cost.
Prompt position matters more than length
In long-context faithfulness testing, instruction placement at the beginning and end of context windows yielded 23% higher compliance than mid-context placement, regardless of total length.
Self-correcting agents complete 31% more tasks
Preliminary results from our reflection experiment show agents with a self-correction step complete significantly more multi-step tasks without human intervention.
Common questions about our experiments
Have an experiment idea?
We're always looking for interesting problems to study. If you have a hypothesis or challenge that could benefit from rigorous testing, let's talk.