Skip to content

Model Research Harness

Research, test, score, rank, and report on model/provider fitness with a provider-neutral nxusKit workflow.

Research model fitness with provider-neutral runs, Promptfoo import, deterministic policy checks, Bayesian confidence, and dry-run lifecycle recommendations.

Scenarios: basic-ticket-routing · promptfoo-import · software-dev

Community - the default path uses provider-neutral LLM calls, mock/local fixtures, CLIPS-style deterministic policy checks, Bayesian confidence scoring, Promptfoo import, external-runner adapters, and dry-run lifecycle recommendations. Optional configs can use nxusKit CLIPS and Bayesian engines when native SDK dependencies are installed, with fixture-safe fallbacks for the public quickstart.

Edition note: Runs in Community Edition. Future Pro profiles may add Solver portfolio optimization and ZEN decision tables.

Optional Pro profile - future solver-backed portfolio selection and ZEN decision-table policies require a Pro or trial entitlement. The public quickstart below does not execute Pro code.

Difficulty: Advanced ♦🏁 · LLM · CLIPS · BN

  • Summary: Python-first model research and compatibility harness.
  • Scenario: Import or define evaluation configs, run provider/model test matrices, score outputs, apply policy, aggregate confidence, and write reports.
  • tech_tags in manifest: LLM, CLIPS, BN - example id model-research-harness in conformance/examples_manifest.json.
  • SDK: Use an installed SDK tree (NXUSKIT_SDK_DIR, NXUSKIT_LIB_PATH as needed); test-examples.sh resolves Go/Rust/Python deps from that tree only - see README.md, scripts/setup-sdk.sh, and scripts/test-examples.sh.
  • Languages in this example: Python is authoritative. CLI/Bash is a thin wrapper around the Python runner.
  • Python: standard library only for the public mock quickstart. The bundled .yaml configs use a strict JSON-compatible YAML subset; PyYAML is optional for broader user-authored YAML.
  • Native CLIPS/BN: use a Python interpreter with cffi installed and an SDK with python/src plus lib/libnxuskit.dylib. The harness automatically adds $NXUSKIT_SDK_DIR/python/src when NXUSKIT_SDK_DIR is set; on macOS, avoid Apple/Xcode Python for native-engine runs unless cffi is installed there. Set NXUSKIT_PYTHON=/path/to/python3 for the Bash wrapper when needed.
ApplicationHow this example applies
Model evaluationScore model candidates against task-specific outputs and report confidence instead of relying on ad hoc impressions
Provider comparisonCompare local and cloud providers through one provider-neutral workflow while keeping capability claims honest
Lifecycle policyGenerate dry-run pull, pin, keep, or retest recommendations bounded by deterministic policy
Software development workflow researchExercise code analysis, bug finding, bugfixing, generation, refactoring, and review scenarios with public-safe fixtures

Attach an installed SDK (NXUSKIT_SDK_DIR). See the repository README.md and scripts/test-examples.sh.

Terminal window
# From `/examples/integrations/model-research-harness`:
cd python && python3 main.py --help
cd bash && bash main.sh --help

Mock mode uses checked-in fixtures. It does not require cloud credentials, Promptfoo, Ollama, or a Pro entitlement.

Terminal window
cd python
python3 main.py --config ../configs/nxuskit-harness-basic.yaml --mode mock --json
python3 main.py --config ../configs/nxuskit-harness-software-dev.yaml --mode mock --output-dir ../.tmp/software-dev

Thin CLI/Bash wrapper:

Terminal window
cd bash
bash main.sh --config ../configs/nxuskit-harness-basic.yaml --mode mock --json

Promptfoo import:

Terminal window
cd python
python3 main.py --import-promptfoo ../configs/promptfoo-basic.yaml --mode import-promptfoo --json
python3 main.py \
--import-promptfoo ../configs/promptfoo-requires-code.yaml \
--compatibility-report ../.tmp/promptfoo-requires-code-report.json \
--json

The second command is intentionally fail-closed: it writes a compatibility report that requires --allow-code or --promptfoo-native-reference.

ConfigPurpose
nxuskit-harness-basic.yamlMinimal mock quickstart for ticket classification
nxuskit-harness-clips-policy.yamlDeterministic CLIPS-style required-field and forbidden-value checks
nxuskit-harness-clips-engine.yamlReal nxusKit ClipsSession policy execution with Python fallback when native CLIPS is unavailable
nxuskit-harness-bayesian-confidence.yamlPosterior confidence from sparse weighted evidence
nxuskit-harness-bn-engine.yamlReal nxusKit BN inference over model-fitness evidence with beta fallback when native BN is unavailable
nxuskit-harness-local-vs-cloud.yamlLocal fixture versus cloud-reference fixture comparison
nxuskit-harness-structured-output.yamlNative JSON-mode claims versus harness-side schema validation
nxuskit-harness-lifecycle-policy.yamlDry-run cache and lifecycle recommendations
nxuskit-harness-lifecycle-mutation-fixture.yamlPublic-safe mutation gate fixture requiring both external and lifecycle approval flags
nxuskit-harness-matrix-template.yamlCompact prompt/parameter matrix expansion syntax
nxuskit-harness-native-ollama-template.yamlNative Ollama schema, image, thinking, and tool-call knobs behind harness scoring
nxuskit-harness-software-dev.yamlCode analysis, bug finding, bugfixing, code generation, refactoring, and review scenarios
nxuskit-harness-external-command-fixture.yamlPublic-safe fixture contract for wrapping operational runners
nxuskit-harness-devops-ollama-parity.yamlOpt-in local adapter shape for a private Ollama research harness checkout
promptfoo-basic.yamlPromptfoo-compatible config that imports and runs
promptfoo-requires-code.yamlPromptfoo config that requires explicit trust or native-reference mode

Test blocks may include a matrix object to expand variants without duplicating config:

{
"id": "format-{{ prompt_variant }}-think-{{ think }}",
"matrix": {
"prompt_variant": ["baseline", "strict"],
"think": ["off", "low"]
}
}

Each combination is merged into test.vars, so prompts, ids, and adapter placeholders can reference the generated values.

The harness can wrap existing operational research runners through explicit external-command tests. This is useful when a team already has domain-specific scripts for local Ollama inventory, structured extraction, image/document pipelines, cache policy, or row-level fixture scoring.

External commands are fail-closed and never run unless --allow-external is supplied:

Terminal window
cd python
python3 main.py \
--config ../configs/nxuskit-harness-external-command-fixture.yaml \
--allow-external \
--json

The fixture config runs only checked-in deterministic fixture commands. The DevOps parity config is a template for private/local use and expects OLLAMA_MODEL_TESTING_ROOT to point at an existing ollama-model-testing checkout:

Terminal window
export OLLAMA_MODEL_TESTING_ROOT=/path/to/ollama-model-testing
cd python
python3 main.py \
--config ../configs/nxuskit-harness-devops-ollama-parity.yaml \
--allow-external \
--only-test common-sense-carwash \
--output-dir ../.tmp/devops-parity

Use --exclude-test to skip expensive tests from a larger config. Both --only-test and --exclude-test accept comma-separated ids and may be repeated.

The public adapter normalizes DevOps-style report shapes for common-sense curation, prompted/native tool intent, direct structured extraction, two-stage OCR or VLM pipelines, safe-labs row-level scoring, and CSV/TSV comparison helpers. The DevOps parity template also includes non-mutating ollama-cache status, list, and plan-evict checks.

External lifecycle mutation such as pulling, removing, pinning, or evicting models requires a test with external_command.mutation: true and both flags:

Terminal window
python3 main.py \
--config ../configs/nxuskit-harness-lifecycle-mutation-fixture.yaml \
--allow-external \
--allow-lifecycle-mutations

Public configs should keep mutation commands behind explicit customer policy bounds.

Each run can write:

  • result.json
  • summary.md
  • Promptfoo compatibility report when importing Promptfoo configs
  • Scenario-level pass/fail matrix
  • Provider/model recommendation table
  • Capability truth table

The capability truth table separates native provider guarantees from harness-side validation and repair. For v1.0.0, Groq remains provider id groq with GROQ_API_KEY; xAI Grok uses provider id xai with XAI_API_KEY.

Live mode uses nxusKit provider factories for ollama, lmstudio, openai, claude, groq, and xai when credentials or local services are configured:

Terminal window
cd python
python3 main.py --config ../configs/nxuskit-harness-basic.yaml --mode live --provider ollama --model llama3.2

Strict schema support, native tool calling, and thinking controls vary by provider and model. The harness reports those differences instead of treating every backend as equivalent.

For Ollama features that are not yet normalized across every provider, set native_ollama: true on the provider or test. That path uses Ollama’s local /api/chat API directly and supports schema/JSON format, think, tools, image files, options, and num_predict. If an Ollama build rejects the think field, the harness retries once without it and records the observed metadata.

Portable Promptfoo configs import directly. Prompt/provider matrices expand to harness tests. Configs with executable or provider-native behavior fail closed unless the caller acknowledges the trust boundary:

Terminal window
python3 main.py --import-promptfoo ../configs/promptfoo-requires-code.yaml --allow-code --json
python3 main.py --import-promptfoo ../configs/promptfoo-requires-code.yaml --promptfoo-native-reference --json

With --allow-code, JavaScript assertions are executed through node when available. Unsupported assertions still fail closed instead of silently disappearing from the score.

The default configs are fixture-safe and run with stdlib Python. Engine configs demonstrate how the same harness can call nxusKit-native reasoning providers:

  • policy.engine: "nxuskit-clips" loads a CLIPS rules file through ClipsSession, asserts the model output as a fact, and converts emitted finding facts into policy dispositions. on_engine_unavailable: "fallback-python" keeps public smoke tests runnable when native CLIPS dependencies are not installed.
  • bayesian.engine: "nxuskit-bn" loads a BIF model through nxusKit BN, maps test scores into evidence, and queries a configured posterior. on_engine_unavailable: "fallback-beta" keeps reports useful on machines without native BN dependencies.

Native-engine smoke example:

Terminal window
export NXUSKIT_SDK_DIR="${NXUSKIT_SDK_DIR:-$HOME/.nxuskit/sdk/current}"
export NXUSKIT_LIB_PATH="$NXUSKIT_SDK_DIR/lib/libnxuskit.dylib"
# Pick a Python with cffi installed. On this development Mac, Homebrew Python works.
/opt/homebrew/bin/python3 -c "import cffi"
/opt/homebrew/bin/python3 python/main.py \
--config configs/nxuskit-harness-clips-engine.yaml \
--json
/opt/homebrew/bin/python3 python/main.py \
--config configs/nxuskit-harness-bn-engine.yaml \
--json
  • Python is the authoritative implementation. Bash remains a thin wrapper for automation-friendly entry points.
  • Promptfoo import covers common portable config shapes, prompt/provider matrices, explicit trust gates for code/native behavior, and JavaScript assertion execution under --allow-code.
  • CLIPS and Bayesian examples include both deterministic/fallback checks and opt-in nxusKit-native engine execution.
  • Operational parity is provided through explicit external-command adapters, not private baked-in assumptions. Public releases should keep private rankings, fixture paths, and cache policy defaults out of bundled configs.
  • Lifecycle mutation remains blocked unless both --allow-external and --allow-lifecycle-mutations are supplied, and customer auto-approval should be bounded in config.