[Bug] MBPP score significantly lower than official results #1855

GenerallyCovetous · 2025-02-07T08:20:28Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': True,
'GCC': 'gcc (GCC) 7.3.0',
'MMEngine': '0.10.6',
'MUSA available': False,
'OpenCV': '4.11.0',
'PyTorch': '2.1.0',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 10.2\n'
' - C++ Version: 201703\n'
' - Intel(R) MKL-DNN v3.1.1 (Git Hash '
'64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: NO AVX\n'
' - Build settings: BLAS_INFO=open, '
'BUILD_TYPE=Release, '
'CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=old-style-cast '
'-Wno-invalid-partial-specialization '
'-Wno-unused-private-field '
'-Wno-aligned-allocation-unavailable '
'-Wno-missing-braces -fdiagnostics-color=always '
'-faligned-new -Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=open, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.1.0, USE_CUDA=OFF, '
'USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.16 (main, Dec 11 2024, 16:18:56) [GCC 11.2.0]',
'TorchVision': '0.16.0',
'lmdeploy': "not installed:No module named 'lmdeploy'",
'numpy_random_seed': 2147483648,
'opencompass': '0.3.9+',
'sys.platform': 'linux',
'transformers': '4.48.0'}

Reproduces the problem - code/configuration sample

python run.py --models hf_llama3_1_8b --datasets sanitized_mbpp_gen_742f0c --debug

Reproduces the problem - command or script

python run.py --models hf_llama3_1_8b --datasets sanitized_mbpp_gen_742f0c --debug

Reproduces the problem - error message

When I was testing the base model for llama3.1-8b, I found that using the config file in the official readme.md came out with a score of only 43.58, while llama3-8b-turbomind in the official readme.md came out with a score of 54.86, which is an excessive difference. What is the reason for this gap in scores?

Other information

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] MBPP score significantly lower than official results #1855

[Bug] MBPP score significantly lower than official results #1855

GenerallyCovetous commented Feb 7, 2025

[Bug] MBPP score significantly lower than official results #1855

[Bug] MBPP score significantly lower than official results #1855

Comments

GenerallyCovetous commented Feb 7, 2025

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information