The SDK provides APIs to save eval results to MCPJam for visualization in the CI Evals dashboard. Results can be saved automatically via EvalTest/EvalSuite, or manually using the APIs below.
Environment Variables
| Variable | Required | Default | Description |
|---|
MCPJAM_API_KEY | Yes | - | Your MCPJam workspace API key |
MCPJAM_BASE_URL | No | https://sdk.mcpjam.com | MCPJam API base URL override |
Use MCPJAM_BASE_URL only when you need to override the default ingest host, such as internal development against a non-production backend.
MCPJAM_API_KEY controls whether results are uploaded. Replay credential capture only happens when you provide serverReplayConfigs, agent, or mcpClientManager. MCP App widget snapshots in each result’s trace (for iframe replay in MCPJam) come from PromptResult.getWidgetSnapshots(), which is only populated when TestAgent was constructed with mcpClientManager.
Uploaded iterations have a single boolean passed that drives pass rate on the dashboard. The SDK distinguishes:
- Structural pass — Your test returned
true, or expected tool calls were satisfied (Inspector/UI flows use shared matching logic).
- Tool execution — The trace shows a real tool failure: MCP results with
isError: true, timeline spans where a tool step ended in error, tool-result parts the UI would treat as errors, or a runner-level iterationError after a thrown tool step.
Default behavior: When the SDK derives passed from a trace (for example EvalTest / EvalSuite auto-save, PromptResult.toEvalResult(), createEvalRunReporter helpers, or Inspector suite runs), structural success is not enough if tool execution failed — the iteration is recorded as failed unless you opt out.
Opt out: set failOnToolError: false on MCPJamReportingConfig (global for that reporter or auto-save), or pass it on specific helper options (addFromPrompt / recordFromRun / RunToEvalResultsOptions, etc.). Use this when you only care that the model invoked the right tools with the right arguments, not that every MCP call returned a success payload.
Manual reportEvalResults: If you build results[] yourself and set passed explicitly, MCPJam stores your values as-is. The execution gate applies to code paths that compute passed from prompts, traces, and iterations.
Programmatic reuse: @mcpjam/sdk also exports finalizePassedForEval, traceIndicatesToolExecutionFailure, isCallToolResultError, and traceMessagePartIndicatesToolFailure for custom ingestion pipelines.
reportEvalResults()
One-shot reporting. Sends all results in a single call. Throws on failure.
import { MCPClientManager, reportEvalResults } from "@mcpjam/sdk";
Signature
reportEvalResults(input: ReportEvalResultsInput): Promise<ReportEvalResultsOutput>
Example
const manager = new MCPClientManager({
asana: {
url: process.env.MCP_SERVER_URL!,
refreshToken: process.env.MCP_REFRESH_TOKEN!,
clientId: process.env.MCP_CLIENT_ID!,
clientSecret: process.env.MCP_CLIENT_SECRET,
},
});
await manager.connectToServer("asana");
const output = await reportEvalResults({
suiteName: "Nightly",
mcpClientManager: manager,
results: [
{ caseTitle: "healthcheck", passed: true },
{ caseTitle: "tool-selection", passed: true, durationMs: 1200 },
{ caseTitle: "edge-case", passed: false, error: "Wrong tool called" },
],
passCriteria: { minimumPassRate: 90 },
ci: {
branch: "main",
commitSha: "abc123",
},
});
console.log(`Run ${output.runId}: ${output.result}`);
// "Run abc123: passed"
console.log(`${output.summary.passed}/${output.summary.total} passed`);
reportEvalResultsSafely()
Same as reportEvalResults(), but returns null instead of throwing on failure. Warnings are logged to the console.
import { MCPClientManager, reportEvalResultsSafely } from "@mcpjam/sdk";
Signature
reportEvalResultsSafely(input: ReportEvalResultsInput): Promise<ReportEvalResultsOutput | null>
Example
const manager = new MCPClientManager({
asana: {
url: process.env.MCP_SERVER_URL!,
refreshToken: process.env.MCP_REFRESH_TOKEN!,
clientId: process.env.MCP_CLIENT_ID!,
},
});
await manager.connectToServer("asana");
const output = await reportEvalResultsSafely({
suiteName: "Nightly",
mcpClientManager: manager,
results: [{ caseTitle: "healthcheck", passed: true }],
});
if (output) {
console.log(`Reported: ${output.summary.passRate * 100}% pass rate`);
} else {
console.log("Reporting failed (non-blocking)");
}
Use reportEvalResultsSafely() when you don’t want eval reporting failures to break your CI pipeline. Use reportEvalResults() (strict) when reporting is critical.
createEvalRunReporter()
Creates an incremental reporter for long-running processes. Results are buffered and flushed in batches (up to 200 results or 1MB per batch).
import { createEvalRunReporter } from "@mcpjam/sdk";
Signature
createEvalRunReporter(input: CreateEvalRunReporterInput): EvalRunReporter
CreateEvalRunReporterInput accepts the same replay source fields as reportEvalResults(): serverReplayConfigs, agent, and mcpClientManager.
EvalRunReporter Methods
| Method | Description |
|---|
add(result) | Buffer a result (no network call) |
record(result) | Buffer a result and auto-flush when buffer is large |
flush() | Upload all buffered results |
finalize() | Flush remaining results and finalize the run |
getBufferedCount() | Number of results in the buffer |
getAddedCount() | Total results added (including flushed) |
setExpectedIterations(count) | Set expected iteration count for progress tracking |
PromptResult Helpers
| Method | Description |
|---|
addFromPrompt(promptResult, overrides?) | Convert a PromptResult and buffer it |
recordFromPrompt(promptResult, overrides?) | Convert a PromptResult, buffer it, and auto-flush |
EvalTest/EvalSuite Run Helpers
| Method | Description |
|---|
addFromRun(run, options) | Convert all iterations from an EvalTest run |
recordFromRun(run, options) | Convert and auto-flush from an EvalTest run |
addFromSuiteRun(suiteRun, options) | Convert all iterations from an EvalSuite run |
recordFromSuiteRun(suiteRun, options) | Convert and auto-flush from an EvalSuite run |
Example
// Assumes `agent` was created with `mcpClientManager: manager`
const reporter = createEvalRunReporter({
suiteName: "Integration Tests",
passCriteria: { minimumPassRate: 85 },
agent,
ci: {
branch: process.env.GITHUB_REF_NAME,
commitSha: process.env.GITHUB_SHA,
},
});
Example with manual replay source resolution
const reporter = createEvalRunReporter({
suiteName: "Integration Tests",
passCriteria: { minimumPassRate: 85 },
mcpClientManager: manager,
});
// Add results as tests complete
await reporter.record({ caseTitle: "test-1", passed: true, durationMs: 500 });
await reporter.record({ caseTitle: "test-2", passed: false, error: "timeout" });
await reporter.record({ caseTitle: "test-3", passed: true });
// Finalize the run
const output = await reporter.finalize();
console.log(`${output.summary.passed}/${output.summary.total} passed`);
Replay credential sources
Authenticated HTTP evals can securely persist replay credentials for reruns and debugging. Manual reporting APIs resolve replay configs in this order:
serverReplayConfigs
agent.getServerReplayConfigs()
mcpClientManager.getServerReplayConfigs()
Prefer passing agent or mcpClientManager directly. Use serverReplayConfigs only when you need a low-level override.
If you also provide serverNames, inferred replay configs are filtered to those server IDs before upload. Explicit serverReplayConfigs are left unchanged.
Uploaded runs can show Replay this run / server-side MCP replay when the ingest payload includes derived serverReplayConfigs (stored as hasServerReplayConfig on the run). In practice:
- HTTP MCP (
url) — Replay configs are built for typical streamable HTTP connections. Stdio transports do not produce entries from MCPClientManager.getServerReplayConfigs(); use HTTP when you need dashboard replay.
TestAgent vs reporter — Putting mcpClientManager on TestAgent fills MCP App widget snapshots on PromptResult. The reporter (and one-shot report*) resolves server replay from its own agent / mcpClientManager fields. Pass agent or mcpClientManager into createEvalRunReporter as well; agent-only wiring can still upload traces and widgets but omit hasServerReplayConfig.
- Teardown order —
finalize() / one-shot reporting calls getServerReplayConfigs() against connected registrations. In afterAll, run await reporter.finalize() (or reportEvalResults) before await manager.disconnectAllServers(). Disconnecting first clears manager state and uploads without replay metadata.
LLM API keys are not stored on the run. Replaying in the MCPJam UI still requires provider keys (e.g. OpenRouter) in Settings for your suite’s models.
Using with PromptResult
// Pass `agent` or `mcpClientManager` when you need server replay metadata in MCPJam
const reporter = createEvalRunReporter({ suiteName: "Prompt Tests", agent });
const result = await agent.prompt("Add 2 and 3");
reporter.addFromPrompt(result, {
caseTitle: "addition",
passed: result.hasToolCall("add"),
});
const output = await reporter.finalize();
Using with EvalTest Runs
const reporter = createEvalRunReporter({ suiteName: "Full Suite" });
const test = new EvalTest({
name: "addition",
test: async (agent) => (await agent.prompt("Add 2+3")).hasToolCall("add"),
});
const run = await test.run(agent, { iterations: 10 });
await reporter.recordFromRun(run, { casePrefix: "addition" });
const output = await reporter.finalize();
uploadEvalArtifact()
Parses test artifacts (JUnit XML, Jest JSON, Vitest JSON) and reports the results to MCPJam.
import { uploadEvalArtifact } from "@mcpjam/sdk";
Signature
uploadEvalArtifact(input: UploadEvalArtifactInput): Promise<ReportEvalResultsOutput>
| Format | Description |
|---|
"junit-xml" | JUnit XML test reports |
"jest-json" | Jest JSON output (--json flag) |
"vitest-json" | Vitest JSON reporter output |
"custom" | Custom parser via customParser option |
Example
import { readFileSync } from "fs";
// Upload JUnit XML
await uploadEvalArtifact({
suiteName: "CI Results",
format: "junit-xml",
artifact: readFileSync("test-results.xml", "utf-8"),
});
// Upload Jest JSON
await uploadEvalArtifact({
suiteName: "Jest Results",
format: "jest-json",
artifact: readFileSync("jest-results.json", "utf-8"),
});
// Custom parser
await uploadEvalArtifact({
suiteName: "Custom",
format: "custom",
artifact: myData,
customParser: (data) => [
{ caseTitle: "test-1", passed: true },
{ caseTitle: "test-2", passed: false, error: "failed" },
],
});
Types
type ReportEvalResultsInput = MCPJamReportingConfig & {
suiteName: string;
results: EvalResultInput[];
agent?: {
getServerReplayConfigs?: () => MCPServerReplayConfig[] | undefined;
};
mcpClientManager?: MCPClientManager;
};
MCPJamReportingConfig
| Property | Type | Required | Description |
|---|
enabled | boolean | No | Enable/disable reporting (default: true) |
apiKey | string | No | MCPJam API key (falls back to MCPJAM_API_KEY env var) |
baseUrl | string | No | MCPJam API base URL override (useful for internal development or tests) |
suiteName | string | No | Suite name for the run |
suiteDescription | string | No | Description of the suite |
serverNames | string[] | No | MCP server names being tested |
serverReplayConfigs | MCPServerReplayConfig[] | No | Advanced override for replay credential capture |
notes | string | No | Free-form notes |
passCriteria | { minimumPassRate: number } | No | Pass threshold (0-100) |
failOnToolError | boolean | No | When not false, results derived from traces treat tool execution failures as failed iterations (default: strict). See Tool execution and passed |
strict | boolean | No | Throw on upload errors (false = warn only) |
externalRunId | string | No | Custom run ID (auto-generated if omitted) |
framework | string | No | Test framework name (e.g., "jest", "vitest") |
ci | EvalCiMetadata | No | CI/CD pipeline context |
expectedIterations | number | No | Expected total iterations for progress tracking |
agent | { getServerReplayConfigs?: () => MCPServerReplayConfig[] | undefined } | No | Preferred replay source for manual reporting; use an agent created with mcpClientManager so results can include widgetSnapshots from PromptResult.toEvalResult() |
mcpClientManager | MCPClientManager | No | Replay source when no agent is provided; does not populate widgetSnapshots unless your results[].widgetSnapshots or trace payloads already include them |
MCPServerReplayConfig
Advanced replay override. Most users should not construct this manually.
| Property | Type | Required | Description |
|---|
serverId | string | Yes | MCP server identifier |
url | string | Yes | MCP server URL |
preferSSE | boolean | No | Prefer SSE transport for replay |
accessToken | string | No | Static bearer token for replay |
refreshToken | string | No | Refresh token for replay |
clientId | string | No | OAuth client ID, required with refreshToken |
clientSecret | string | No | OAuth client secret when needed for token refresh |
| Property | Type | Description |
|---|
provider | string | CI provider (e.g., "github", "gitlab") |
pipelineId | string | Pipeline/workflow identifier |
jobId | string | Job identifier |
runUrl | string | URL to the CI run |
branch | string | Git branch name |
commitSha | string | Git commit SHA |
| Property | Type | Required | Description |
|---|
caseTitle | string | Yes | Test case title |
passed | boolean | Yes | Whether the test passed |
query | string | No | The prompt/query sent |
durationMs | number | No | Test duration in ms |
provider | string | No | LLM provider name |
model | string | No | Model identifier |
expectedToolCalls | EvalExpectedToolCall[] | No | Expected tool calls |
actualToolCalls | EvalExpectedToolCall[] | No | Actual tool calls made |
tokens | { input?, output?, total? } | No | Token usage |
error | string | No | Error message |
errorDetails | string | No | Detailed error info |
trace | EvalTraceInput | No | Conversation trace |
externalIterationId | string | No | Custom iteration ID |
externalCaseId | string | No | Custom case ID |
metadata | Record<string, string | number | boolean> | No | Custom metadata |
isNegativeTest | boolean | No | Whether this is a negative test |
widgetSnapshots | EvalWidgetSnapshotInput[] | No | MCP App HTML replay payloads (typically from getWidgetSnapshots() on PromptResult). Omitted when not using MCP Apps or when TestAgent had no mcpClientManager |
ReportEvalResultsOutput
| Property | Type | Description |
|---|
suiteId | string | Created/matched suite ID |
runId | string | Created run ID |
status | "completed" | "failed" | Run status |
result | "passed" | "failed" | Pass/fail based on criteria |
summary.total | number | Total iterations |
summary.passed | number | Passed iterations |
summary.failed | number | Failed iterations |
summary.passRate | number | Pass rate (0.0 - 1.0) |