CausalBench-Enterprise: Evaluating Risk-Aware Causal Reasoning in Large Language Models
DOI:
https://doi.org/10.65563/jeaai.v1i8.71Keywords:
table tennis; inertial measurement units; large language models; sports analytics; tactical analysis; synthetic data; reliability analysis; IMU-to-text pipelineAbstract
Large language models (LLMs) are increasingly deployed in enterprise decision pipelines, yet their causal reasoning reliability and risk awareness remain poorly understood. Existing evaluations often test surface-level correlations or static benchmarks, overlooking uncertainty calibration and business impact. We introduce CausalBench-Enterprise, a unified benchmark and evaluation framework for assessing causal correctness and enterprise risk awareness across modern LLMs. Our system standardizes structured scenario prompts and computes two complementary metrics: accuracy and the proposed Enterprise Risk Score (ERS), which penalizes confident misjudgments under realistic business weights. We benchmark seven frontier models from OpenAI, Anthropic, Mistral, Meta, Google, and Alibaba under identical conditions using a unified OpenRouter API runner. Results reveal that GPT-4o-mini and Claude-3.5-Sonnet achieve the strongest causal reliability and calibration, while open-weight models (Llama-3.1, Mistral-7B) approach parity in accuracy but exhibit higher overconfidence. ERS exposes subtle yet critical gaps between correctness and risk sensitivity, suggesting that future LLM deployment in enterprise reasoning must jointly optimize for both accuracy and calibrated confidence.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 youla yang

This work is licensed under a Creative Commons Attribution 4.0 International License.