CausalBench-Enterprise: Evaluating Risk-Aware Causal Reasoning in Large Language Models

Authors

  • youla yang Indiana University

DOI:

https://doi.org/10.65563/jeaai.v1i8.71

Keywords:

table tennis; inertial measurement units; large language models; sports analytics; tactical analysis; synthetic data; reliability analysis; IMU-to-text pipeline

Abstract

Large language models (LLMs) are increasingly deployed in enterprise decision pipelines, yet their causal reasoning reliability and risk awareness remain poorly understood. Existing evaluations often test surface-level correlations or static benchmarks, overlooking uncertainty calibration and business impact. We introduce CausalBench-Enterprise, a unified benchmark and evaluation framework for assessing causal correctness and enterprise risk awareness across modern LLMs. Our system standardizes structured scenario prompts and computes two complementary metrics: accuracy and the proposed Enterprise Risk Score (ERS), which penalizes confident misjudgments under realistic business weights. We benchmark seven frontier models from OpenAI, Anthropic, Mistral, Meta, Google, and Alibaba under identical conditions using a unified OpenRouter API runner. Results reveal that GPT-4o-mini and Claude-3.5-Sonnet achieve the strongest causal reliability and calibration, while open-weight models (Llama-3.1, Mistral-7B) approach parity in accuracy but exhibit higher overconfidence. ERS exposes subtle yet critical gaps between correctness and risk sensitivity, suggesting that future LLM deployment in enterprise reasoning must jointly optimize for both accuracy and calibrated confidence.

Downloads

Published

2025-11-30