AI Research Without a Microscope — Structured Thoughts

Everyone is building AI. Almost nobody can systematically study what AI does.

OpenAI filed for its IPO last week. In the filing, they state their goal: build “automated AI researchers.” Sam Altman has said that by March 2028, “a large portion of research may be carried out collaboratively by AI systems and human researchers.”

Researchers studying AI itself should pay attention. Not to the automation. To what is missing.

The missing instrument

When a biologist wants to understand cellular behavior, they use a microscope. When a physicist wants to study particle interactions, they use a detector. The instrument does not just magnify what you already see. It reveals structure you could not observe without it.

AI research has no equivalent instrument.

When a researcher wants to understand how GPT-4 responds to adversarial prompts, they write a Python script. When they want to compare reasoning strategies across models, they build a custom harness. When they want to reproduce someone else’s experiment, they discover the model version changed, the prompts were paraphrased from memory, the temperature was not recorded, and the intermediate outputs were discarded.

This is not a tooling inconvenience. It is a methodological crisis. The systems we are trying to study are probabilistic, opaque, and constantly changing. The tools we use to study them are ad hoc scripts that record only what we remembered to log. We are doing AI research the way scientists did microscopy before the microscope: observing outcomes and guessing at mechanisms.

What a research instrument for AI would need

Five properties, none of which current tools provide together:

Automatic recording. Every model interaction recorded without configuration. Not “logging you can enable.” Recording that is structural. The prompt, the response, the model version, the token counts, the latency, the cost. Every time. No exceptions.

Model agnosticism. Research that studies AI behavior cannot be locked to one provider. You need to run the same experiment on Claude, GPT-4, Gemini, Llama, DeepSeek, and local models with the same protocol and the same trace format. Switching models should be a parameter change, not a rewrite.

Executable protocols. The experiment description and the experiment execution should be the same artifact. Not a paper that describes what was done and a script that approximates it. One thing that is both readable and runnable.

Governed execution. The protocol should be enforceable. If the experiment says “use temperature 0.7 and max_tokens 1000,” those parameters should be used. If someone modifies the protocol mid-experiment, the modification should be recorded. Selective reporting, parameter tweaking, and undocumented changes should be structurally impossible without an auditable record.

Queryable data. The output of an experiment should not be a pile of JSON files. It should be structured, queryable data that can be analyzed with standard tools. How did reasoning token usage vary across models? What was the average latency per step type? Which prompts produced the most divergent responses? These questions should be SQL queries on the experimental record, not custom parsing scripts.

What this looks like in practice

A cross-model comparison study. The research question: how do different models handle multi-step reasoning tasks?

machine reasoning_comparison

  has description "Compare multi-step reasoning across models"
  has model from input.model_name

  accepts
    model_name as text, is required
    task_description as text, is required
    difficulty as text, defaults to "medium"

  responds with
    answer as text
    reasoning_steps as number
    confidence as text

  achieves evaluate
    step decompose
      ask "Break this task into logical steps"
      with task from input.task_description

    step reason
      ask "Execute each step and produce a final answer"
      with context from steps.decompose

    step assess
      compute result from steps.reason
      output answer as steps.reason.answer
      output reasoning_steps as steps.decompose.step_count
      output confidence as steps.reason.confidence

Run it with model_name: "anthropic:claude-sonnet". Run it again with model_name: "openai:gpt-4o". Run it again with model_name: "google:gemini-pro". Same machine. Same governance. Same trace format.

The behavioral ledger now contains, for each run: the exact prompts sent to each model, the exact responses received, the token counts (input, output, reasoning), the latency per step, and the cost. No custom logging. No parsing scripts. The data is already structured and queryable:

SELECT s.model, s.name, avg(s.duration_ms) as avg_ms,
       avg(s.input_tokens) as avg_in, avg(s.output_tokens) as avg_out,
       avg(s.reasoning_tokens) as avg_reason
FROM steps s JOIN runs r ON s.run_id = r.id
WHERE s.step_type = 'reason'
GROUP BY s.model, s.name;

That is what a research instrument looks like. The experiment is the machine. The data is the ledger. The protocol is enforceable. The results are reproducible. Another researcher reads the machine definition, runs it, and gets a comparable trace.

Cross-lingual AI behavior

Here is a research question nobody can currently study systematically: does GPT-4 reason differently in Japanese than in English?

In mashin, the experiment is structurally controlled. The same machine, written in Japanese, compiles to identical bytecode. The prompts go to the model in Japanese. The responses come back in Japanese. The trace is in Japanese. The governance is identical. The only variable is the language.

No other platform can do this. Every other AI research tool requires English. mashin is the first platform where cross-lingual AI behavior can be studied with controlled experiments rather than anecdotal observation.

The ledger as research dataset

The behavioral ledger is not documentation. It is a structured database. Every run of every machine produces rows in steps, runs, policy_decisions. The schema is consistent across all experiments. This means:

Meta-analysis across experiments. Query the ledger across all your experiments. Which models are most cost-efficient for reasoning tasks? Which prompt structures produce the most consistent outputs? The data accumulates automatically.
Longitudinal tracking. Run the same benchmark monthly as models update. The ledger captures performance drift over time. No custom tracking infrastructure needed.
Sharing raw data. The ledger is portable. Share it alongside your paper. Reviewers can query it, verify claims, and conduct their own analyses on your experimental data.

Why this matters now

The AI research community is about to face three simultaneous pressures:

Scale. As models get cheaper and faster, experiments get bigger. A researcher who ran 100 evaluations last year will run 10,000 next year. Ad hoc scripts do not scale. Structured instruments do.

Reproducibility expectations. Conferences and journals are tightening reproducibility requirements. “We used GPT-4” is no longer sufficient. Reviewers want to know: which version, what prompt, what parameters, what intermediate results. The ledger answers all of these automatically.

Regulatory interest. The EU AI Act classifies education and training as high-risk. AI systems used in academic research contexts may fall under audit requirements. The behavioral ledger satisfies audit requirements by construction.

OpenAI wants to build automated AI researchers. The question that matters is not whether AI can do research. It is whether anyone can verify what the AI did. The answer requires an instrument, not a smarter model.

mashin is the instrument. Not because we designed it for research. Because the properties that make AI governance structural (automatic recording, governed execution, reproducible traces, model agnosticism) are the same properties that make AI research rigorous.

The microscope was not built for any one experiment. It was built to make observation systematic. What you study with it is up to you.