Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games | AIChainDay