Saving Eval Results

The SDK provides APIs to save eval results to MCPJam for visualization in the CI Evals dashboard. Results can be saved automatically via EvalTest/EvalSuite, or manually using the APIs below.

Environment Variables

Variable	Required	Default	Description
`MCPJAM_API_KEY`	Yes	-	Your MCPJam workspace API key
`MCPJAM_BASE_URL`	No	`https://sdk.mcpjam.com`	MCPJam API base URL override

Use MCPJAM_BASE_URL only when you need to override the default ingest host, such as internal development against a non-production backend. MCPJAM_API_KEY controls whether results are uploaded. Replay credential capture only happens when you provide serverReplayConfigs, agent, or mcpClientManager. MCP App widget snapshots in each result’s trace (for iframe replay in MCPJam) come from PromptResult.getWidgetSnapshots(), which is only populated when TestAgent was constructed with mcpClientManager.

Tool execution and `passed`

Uploaded iterations have a single boolean passed that drives pass rate on the dashboard. The SDK distinguishes:

Structural pass — Your test returned true, or expected tool calls were satisfied (Inspector/UI flows use shared matching logic).
Tool execution — The trace shows a real tool failure: MCP results with isError: true, timeline spans where a tool step ended in error, tool-result parts the UI would treat as errors, or a runner-level iterationError after a thrown tool step.

Default behavior: When the SDK derives passed from a trace (for example EvalTest / EvalSuite auto-save, PromptResult.toEvalResult(), createEvalRunReporter helpers, or Inspector suite runs), structural success is not enough if tool execution failed — the iteration is recorded as failed unless you opt out. Opt out: set failOnToolError: false on MCPJamReportingConfig (global for that reporter or auto-save), or pass it on specific helper options (addFromPrompt / recordFromRun / RunToEvalResultsOptions, etc.). Use this when you only care that the model invoked the right tools with the right arguments, not that every MCP call returned a success payload. Manual reportEvalResults: If you build results[] yourself and set passed explicitly, MCPJam stores your values as-is. The execution gate applies to code paths that compute passed from prompts, traces, and iterations. Programmatic reuse: @mcpjam/sdk also exports finalizePassedForEval, traceIndicatesToolExecutionFailure, isCallToolResultError, and traceMessagePartIndicatesToolFailure for custom ingestion pipelines.

reportEvalResults()

One-shot reporting. Sends all results in a single call. Throws on failure.

import { MCPClientManager, reportEvalResults } from "@mcpjam/sdk";

Signature

reportEvalResults(input: ReportEvalResultsInput): Promise<ReportEvalResultsOutput>

Example

const manager = new MCPClientManager({
  asana: {
    url: process.env.MCP_SERVER_URL!,
    refreshToken: process.env.MCP_REFRESH_TOKEN!,
    clientId: process.env.MCP_CLIENT_ID!,
    clientSecret: process.env.MCP_CLIENT_SECRET,
  },
});

await manager.connectToServer("asana");

const output = await reportEvalResults({
  suiteName: "Nightly",
  mcpClientManager: manager,
  results: [
    { caseTitle: "healthcheck", passed: true },
    { caseTitle: "tool-selection", passed: true, durationMs: 1200 },
    { caseTitle: "edge-case", passed: false, error: "Wrong tool called" },
  ],
  passCriteria: { minimumPassRate: 90 },
  ci: {
    branch: "main",
    commitSha: "abc123",
  },
});

console.log(`Run ${output.runId}: ${output.result}`);
// "Run abc123: passed"
console.log(`${output.summary.passed}/${output.summary.total} passed`);

reportEvalResultsSafely()

Same as reportEvalResults(), but returns null instead of throwing on failure. Warnings are logged to the console.

import { MCPClientManager, reportEvalResultsSafely } from "@mcpjam/sdk";

Signature

reportEvalResultsSafely(input: ReportEvalResultsInput): Promise<ReportEvalResultsOutput | null>

Example

const manager = new MCPClientManager({
  asana: {
    url: process.env.MCP_SERVER_URL!,
    refreshToken: process.env.MCP_REFRESH_TOKEN!,
    clientId: process.env.MCP_CLIENT_ID!,
  },
});

await manager.connectToServer("asana");

const output = await reportEvalResultsSafely({
  suiteName: "Nightly",
  mcpClientManager: manager,
  results: [{ caseTitle: "healthcheck", passed: true }],
});

if (output) {
  console.log(`Reported: ${output.summary.passRate * 100}% pass rate`);
} else {
  console.log("Reporting failed (non-blocking)");
}

Use reportEvalResultsSafely() when you don’t want eval reporting failures to break your CI pipeline. Use reportEvalResults() (strict) when reporting is critical.

createEvalRunReporter()

Creates an incremental reporter for long-running processes. Results are buffered and flushed in batches (up to 200 results or 1MB per batch).

import { createEvalRunReporter } from "@mcpjam/sdk";

Signature

createEvalRunReporter(input: CreateEvalRunReporterInput): EvalRunReporter

CreateEvalRunReporterInput accepts the same replay source fields as reportEvalResults(): serverReplayConfigs, agent, and mcpClientManager.

EvalRunReporter Methods

Method	Description
`add(result)`	Buffer a result (no network call)
`record(result)`	Buffer a result and auto-flush when buffer is large
`flush()`	Upload all buffered results
`finalize()`	Flush remaining results and finalize the run
`getBufferedCount()`	Number of results in the buffer
`getAddedCount()`	Total results added (including flushed)
`setExpectedIterations(count)`	Set expected iteration count for progress tracking

PromptResult Helpers

Method	Description
`addFromPrompt(promptResult, overrides?)`	Convert a `PromptResult` and buffer it
`recordFromPrompt(promptResult, overrides?)`	Convert a `PromptResult`, buffer it, and auto-flush

EvalTest/EvalSuite Run Helpers

Method	Description
`addFromRun(run, options)`	Convert all iterations from an `EvalTest` run
`recordFromRun(run, options)`	Convert and auto-flush from an `EvalTest` run
`addFromSuiteRun(suiteRun, options)`	Convert all iterations from an `EvalSuite` run
`recordFromSuiteRun(suiteRun, options)`	Convert and auto-flush from an `EvalSuite` run

Example

// Assumes `agent` was created with `mcpClientManager: manager`
const reporter = createEvalRunReporter({
  suiteName: "Integration Tests",
  passCriteria: { minimumPassRate: 85 },
  agent,
  ci: {
    branch: process.env.GITHUB_REF_NAME,
    commitSha: process.env.GITHUB_SHA,
  },
});

Example with manual replay source resolution

const reporter = createEvalRunReporter({
  suiteName: "Integration Tests",
  passCriteria: { minimumPassRate: 85 },
  mcpClientManager: manager,
});

// Add results as tests complete
await reporter.record({ caseTitle: "test-1", passed: true, durationMs: 500 });
await reporter.record({ caseTitle: "test-2", passed: false, error: "timeout" });
await reporter.record({ caseTitle: "test-3", passed: true });

// Finalize the run
const output = await reporter.finalize();
console.log(`${output.summary.passed}/${output.summary.total} passed`);

Replay credential sources

Authenticated HTTP evals can securely persist replay credentials for reruns and debugging. Manual reporting APIs resolve replay configs in this order:

serverReplayConfigs
agent.getServerReplayConfigs()
mcpClientManager.getServerReplayConfigs()

Prefer passing agent or mcpClientManager directly. Use serverReplayConfigs only when you need a low-level override. If you also provide serverNames, inferred replay configs are filtered to those server IDs before upload. Explicit serverReplayConfigs are left unchanged.

Replay metadata for the MCPJam UI

Uploaded runs can show Replay this run / server-side MCP replay when the ingest payload includes derived serverReplayConfigs (stored as hasServerReplayConfig on the run). In practice:

HTTP MCP (url) — Replay configs are built for typical streamable HTTP connections. Stdio transports do not produce entries from MCPClientManager.getServerReplayConfigs(); use HTTP when you need dashboard replay.
TestAgent vs reporter — Putting mcpClientManager on TestAgent fills MCP App widget snapshots on PromptResult. The reporter (and one-shot report*) resolves server replay from its own agent / mcpClientManager fields. Pass agent or mcpClientManager into createEvalRunReporter as well; agent-only wiring can still upload traces and widgets but omit hasServerReplayConfig.
Teardown order — finalize() / one-shot reporting calls getServerReplayConfigs() against connected registrations. In afterAll, run await reporter.finalize() (or reportEvalResults) before await manager.disconnectAllServers(). Disconnecting first clears manager state and uploads without replay metadata.

LLM API keys are not stored on the run. Replaying in the MCPJam UI still requires provider keys (e.g. OpenRouter) in Settings for your suite’s models.

Using with PromptResult

// Pass `agent` or `mcpClientManager` when you need server replay metadata in MCPJam
const reporter = createEvalRunReporter({ suiteName: "Prompt Tests", agent });

const result = await agent.prompt("Add 2 and 3");
reporter.addFromPrompt(result, {
  caseTitle: "addition",
  passed: result.hasToolCall("add"),
});

const output = await reporter.finalize();

Using with EvalTest Runs

const reporter = createEvalRunReporter({ suiteName: "Full Suite" });

const test = new EvalTest({
  name: "addition",
  test: async (agent) => (await agent.prompt("Add 2+3")).hasToolCall("add"),
});

const run = await test.run(agent, { iterations: 10 });
await reporter.recordFromRun(run, { casePrefix: "addition" });

const output = await reporter.finalize();

uploadEvalArtifact()

Parses test artifacts (JUnit XML, Jest JSON, Vitest JSON) and reports the results to MCPJam.

import { uploadEvalArtifact } from "@mcpjam/sdk";

Signature

uploadEvalArtifact(input: UploadEvalArtifactInput): Promise<ReportEvalResultsOutput>

Supported Formats

Format	Description
`"junit-xml"`	JUnit XML test reports
`"jest-json"`	Jest JSON output (`--json` flag)
`"vitest-json"`	Vitest JSON reporter output
`"custom"`	Custom parser via `customParser` option

Example

import { readFileSync } from "fs";

// Upload JUnit XML
await uploadEvalArtifact({
  suiteName: "CI Results",
  format: "junit-xml",
  artifact: readFileSync("test-results.xml", "utf-8"),
});

// Upload Jest JSON
await uploadEvalArtifact({
  suiteName: "Jest Results",
  format: "jest-json",
  artifact: readFileSync("jest-results.json", "utf-8"),
});

// Custom parser
await uploadEvalArtifact({
  suiteName: "Custom",
  format: "custom",
  artifact: myData,
  customParser: (data) => [
    { caseTitle: "test-1", passed: true },
    { caseTitle: "test-2", passed: false, error: "failed" },
  ],
});

Types

ReportEvalResultsInput

type ReportEvalResultsInput = MCPJamReportingConfig & {
  suiteName: string;
  results: EvalResultInput[];
  agent?: {
    getServerReplayConfigs?: () => MCPServerReplayConfig[] | undefined;
  };
  mcpClientManager?: MCPClientManager;
};

MCPJamReportingConfig

Property	Type	Required	Description
`enabled`	`boolean`	No	Enable/disable reporting (default: `true`)
`apiKey`	`string`	No	MCPJam API key (falls back to `MCPJAM_API_KEY` env var)
`baseUrl`	`string`	No	MCPJam API base URL override (useful for internal development or tests)
`suiteName`	`string`	No	Suite name for the run
`suiteDescription`	`string`	No	Description of the suite
`serverNames`	`string[]`	No	MCP server names being tested
`serverReplayConfigs`	`MCPServerReplayConfig[]`	No	Advanced override for replay credential capture
`notes`	`string`	No	Free-form notes
`passCriteria`	`{ minimumPassRate: number }`	No	Pass threshold (0-100)
`failOnToolError`	`boolean`	No	When not `false`, results derived from traces treat tool execution failures as failed iterations (default: strict). See Tool execution and `passed`
`strict`	`boolean`	No	Throw on upload errors (`false` = warn only)
`externalRunId`	`string`	No	Custom run ID (auto-generated if omitted)
`framework`	`string`	No	Test framework name (e.g., `"jest"`, `"vitest"`)
`ci`	`EvalCiMetadata`	No	CI/CD pipeline context
`expectedIterations`	`number`	No	Expected total iterations for progress tracking
`agent`	`{ getServerReplayConfigs?: () => MCPServerReplayConfig[] \| undefined }`	No	Preferred replay source for manual reporting; use an agent created with `mcpClientManager` so results can include `widgetSnapshots` from `PromptResult.toEvalResult()`
`mcpClientManager`	`MCPClientManager`	No	Replay source when no `agent` is provided; does not populate `widgetSnapshots` unless your `results[].widgetSnapshots` or trace payloads already include them

MCPServerReplayConfig

Advanced replay override. Most users should not construct this manually.

Property	Type	Required	Description
`serverId`	`string`	Yes	MCP server identifier
`url`	`string`	Yes	MCP server URL
`preferSSE`	`boolean`	No	Prefer SSE transport for replay
`accessToken`	`string`	No	Static bearer token for replay
`refreshToken`	`string`	No	Refresh token for replay
`clientId`	`string`	No	OAuth client ID, required with `refreshToken`
`clientSecret`	`string`	No	OAuth client secret when needed for token refresh

EvalCiMetadata

Property	Type	Description
`provider`	`string`	CI provider (e.g., `"github"`, `"gitlab"`)
`pipelineId`	`string`	Pipeline/workflow identifier
`jobId`	`string`	Job identifier
`runUrl`	`string`	URL to the CI run
`branch`	`string`	Git branch name
`commitSha`	`string`	Git commit SHA

EvalResultInput

Property	Type	Required	Description
`caseTitle`	`string`	Yes	Test case title
`passed`	`boolean`	Yes	Whether the test passed
`query`	`string`	No	The prompt/query sent
`durationMs`	`number`	No	Test duration in ms
`provider`	`string`	No	LLM provider name
`model`	`string`	No	Model identifier
`expectedToolCalls`	`EvalExpectedToolCall[]`	No	Expected tool calls
`actualToolCalls`	`EvalExpectedToolCall[]`	No	Actual tool calls made
`tokens`	`{ input?, output?, total? }`	No	Token usage
`error`	`string`	No	Error message
`errorDetails`	`string`	No	Detailed error info
`trace`	`EvalTraceInput`	No	Conversation trace
`externalIterationId`	`string`	No	Custom iteration ID
`externalCaseId`	`string`	No	Custom case ID
`metadata`	`Record<string, string \| number \| boolean>`	No	Custom metadata
`isNegativeTest`	`boolean`	No	Whether this is a negative test
`widgetSnapshots`	`EvalWidgetSnapshotInput[]`	No	MCP App HTML replay payloads (typically from `getWidgetSnapshots()` on `PromptResult`). Omitted when not using MCP Apps or when `TestAgent` had no `mcpClientManager`

ReportEvalResultsOutput

Property	Type	Description
`suiteId`	`string`	Created/matched suite ID
`runId`	`string`	Created run ID
`status`	`"completed" \| "failed"`	Run status
`result`	`"passed" \| "failed"`	Pass/fail based on criteria
`summary.total`	`number`	Total iterations
`summary.passed`	`number`	Passed iterations
`summary.failed`	`number`	Failed iterations
`summary.passRate`	`number`	Pass rate (0.0 - 1.0)

Running Evals - Conceptual guide
EvalTest Reference - EvalTest API
EvalSuite Reference - EvalSuite API

Overview

Inspector Features

SDK

Guides

Troubleshooting

Environment Variables

Tool execution and `passed`

reportEvalResults()

Signature

Example

reportEvalResultsSafely()

Signature

Example

createEvalRunReporter()

Signature

EvalRunReporter Methods

PromptResult Helpers

EvalTest/EvalSuite Run Helpers

Example

Example with manual replay source resolution

Replay credential sources

Replay metadata for the MCPJam UI

Using with PromptResult

Using with EvalTest Runs

uploadEvalArtifact()

Signature

Supported Formats

Example

Types

ReportEvalResultsInput

MCPJamReportingConfig

MCPServerReplayConfig

EvalCiMetadata

EvalResultInput

ReportEvalResultsOutput

Overview

Inspector Features

SDK

Guides

Troubleshooting

​Environment Variables

​Tool execution and passed

​reportEvalResults()

​Signature

​Example

​reportEvalResultsSafely()

​Signature

​Example

​createEvalRunReporter()

​Signature

​EvalRunReporter Methods

​PromptResult Helpers

​EvalTest/EvalSuite Run Helpers

​Example

​Example with manual replay source resolution

​Replay credential sources

​Replay metadata for the MCPJam UI

​Using with PromptResult

​Using with EvalTest Runs

​uploadEvalArtifact()

​Signature

​Supported Formats

​Example

​Types

​ReportEvalResultsInput

​MCPJamReportingConfig

​MCPServerReplayConfig

​EvalCiMetadata

​EvalResultInput

​ReportEvalResultsOutput

​Related

Environment Variables

Tool execution and `passed`

reportEvalResults()

Signature

Example

reportEvalResultsSafely()

Signature

Example

createEvalRunReporter()

Signature

EvalRunReporter Methods

PromptResult Helpers

EvalTest/EvalSuite Run Helpers

Example

Example with manual replay source resolution

Replay credential sources

Replay metadata for the MCPJam UI

Using with PromptResult

Using with EvalTest Runs

uploadEvalArtifact()

Signature

Supported Formats

Example

Types

ReportEvalResultsInput

MCPJamReportingConfig

MCPServerReplayConfig

EvalCiMetadata

EvalResultInput

ReportEvalResultsOutput

Related