Benchmark

LLM instruction-following evaluation service

The service provides comprehensive evaluation and customized solutions for LLMs, offering model performance assessment, data strategy consultation, tailored dataset creation, and cost-effective iteration management.
Schedule a Call
LLM Benchmark Dataset Features
Model Target
Professional scene coverage
High Data Quality Standards
Evaluation-Algorithm Collaboration Iteration
Benchmark with custom datasets
Data-strategy-centered service dedicated to ensuring top-notch quality throughout the planning and production phases.
1. Data Production Strategy
Standardize data content based on model objectives, ensuring both diversity and accuracy.
2. Incentive Mechanism
Implement an incentive mechanism for the data team, aligning tasks with specialized expertise to cover a wide range of data scenarios.
circle
4. Data Analytics Report
Provide comprehensive statistical analysis of data distribution and corner case summaries to optimize the production process and uphold the highest data quality standards.
3. Data Strategy Training
Collaborate with the data team to refine production processes and deepen data understanding.
CIF-Bench
The first LLM instruction-following evaluation services in the world.
Evaluation system
20
Dimensions
20
150
Tasks
15000
Evaluation Data Points
4
Metrics Framework
Automated Evaluation Ranking
The evaluation of 28 selected LLMs models revealed significant performance gaps, highlighting the limitations of LLMs' generalization capabilities in Chinese task environments.
Model NameOverallChinese CultureClassificationCodeCommonsenseCreative NLGEvaluationGrammarLinguisticMotion DetectionNERNLIQAReasoningRole PlayingSentimentStructured DataStyle TransferSummarizationToxicTranslation
Baichuan2-13B-Chat0.5290.520.6740.3330.6410.4970.6860.5420.5280.5780.5630.6320.5690.5150.7520.6240.4590.4620.3320.4410.273
Qwen-72B-Chat0.5190.4860.630.2960.6340.5080.6340.4580.520.4940.550.6260.5650.5280.7620.6130.4960.4590.2820.6080.271
Yi-34B-Chat0.5120.4830.6060.3470.6230.4970.5980.480.490.5750.5250.6190.5540.4940.7570.580.4720.4390.3460.5140.259
Qwen-14B-Chat0.50.4810.5820.3070.6140.4940.6450.4280.4750.4960.5130.6160.5480.5070.7640.5830.4690.4530.2830.5750.262
Deepseek-Llm-67B-Chat0.4710.4670.5710.2590.5770.4860.5490.4420.4760.4750.5090.5660.4960.4390.7110.5460.4090.4360.2620.570.235
Baichuan-13B-Chat0.450.4080.4910.2860.5520.4390.670.4170.4220.4820.4860.5650.5050.3770.7040.5520.3870.4020.350.4310.304
Chatglm3-6B0.4360.3810.4390.330.5410.4520.5770.310.3580.4360.4530.5440.5030.4140.7620.560.4460.4020.3210.3910.27
Yi-6B-Chat0.4170.4020.4540.3130.5230.4250.5060.3830.3830.4870.3960.5230.4570.3690.7540.4820.4010.380.310.4550.227
Baichuan2-7B-Chat0.4120.4370.6470.160.520.4020.580.5110.4440.4550.4070.4890.3950.4060.670.5170.3420.2980.1010.4630.138
Chatglm2-6B0.3520.2780.4690.3460.4030.4240.5350.2740.3970.4060.240.3970.3520.3260.7140.4380.2980.3130.320.4610.19
Chatglm-6B-Sft0.3490.2650.4540.3650.3850.4620.5540.2960.3790.4270.2320.380.3210.2920.7180.4150.2960.3330.3510.4410.19
Chinese-Llama2-Linly-13B0.3440.250.4620.3110.3990.4290.5570.2730.3580.3850.2680.390.330.3130.6530.4330.2790.3320.2920.4570.181
Gpt-3.5-Turbo-Sft0.3430.2690.4270.2980.3890.3950.5750.3250.3650.3890.2260.3820.3940.3450.710.4330.3240.2660.290.3970.225
Chinese-Alpaca-2-13B0.3410.2420.4210.3560.3820.4420.6020.2560.3630.430.210.3760.3340.3170.7140.4590.2990.3160.3080.4520.2
Chinese-Alpaca-13B0.3340.250.3990.3480.3640.4350.6160.2750.3490.4210.2230.370.3090.3190.7240.4260.2850.3070.2980.4450.181
Chinese-Alpaca-7B0.3340.2160.4120.3780.3810.4250.5760.2650.3590.3930.2430.3830.3260.2950.710.4090.3010.3270.3250.4050.186
Chinese-Llama2-Linly-7B0.3330.2180.4510.330.3960.4270.5830.2480.350.410.2310.3670.3450.2760.6980.4330.2590.3150.310.4690.168
Tigerbot-13B-Chat0.3310.2050.3970.3090.3850.420.6140.310.3790.3410.2760.3630.3290.3010.6940.4190.280.310.2830.3930.186
Telechat-7B0.3290.2670.3380.3210.420.4040.420.2720.2650.3270.320.3880.3550.2440.6720.3440.3340.3350.2990.3640.184
Ziya-Llama-13B0.3290.1960.4020.3240.3410.4280.6160.3120.3490.40.2280.3510.2790.3130.7210.4680.3110.2910.2780.4310.175
Chinese-Alpaca-33B0.3260.2340.370.3720.3640.4290.6140.2460.3180.3770.2210.3680.30.3140.7130.4280.2880.3030.2950.4010.199
Tigerbot-7B-Chat0.3250.2180.3950.3060.370.4130.6310.2940.370.3680.2150.3550.3130.2920.7130.4150.2830.3150.290.3890.171
Chinese-Alpaca-2-7B0.3230.2150.3740.3350.3660.4150.5460.2570.3260.3950.2150.3750.3180.2890.6980.4170.2850.3030.3120.4390.193
Aquilachat-7B0.3090.1620.2340.2910.320.4370.3440.1350.2660.3090.2870.3370.3420.2360.6090.2550.2490.40.5270.430.306
Moss-Moon-003-Sft0.3020.2140.4050.2740.3470.380.4480.3050.3410.3780.2320.3170.3210.2670.6940.3750.2510.2590.2880.4240.152
Qwen-7B-Chat0.3010.2110.410.2890.3490.3910.5310.2190.3870.4040.2080.3250.2970.2780.6810.4190.2660.2510.2480.3710.157
Belle-13B-Sft0.2640.1980.3070.2850.3160.3490.4090.2370.3050.2220.1770.3170.2840.2420.6310.2990.2440.2220.2340.2960.133
Cpm-Bee-10B0.2440.2340.3770.0240.2780.3110.2550.3020.2780.3270.1480.2860.2240.1470.6030.2770.1170.2630.220.3520.125
CIF-Bench Evaluation Service
We provide highly customizable evaluation services that seamlessly integrate with API or model code.
Unlimited Access to CIF-Bench Public Dataset
Including 45,000 data instances, the Chinese Instruction-Following Benchmark (CIF-Bench) enables the development of more adaptable, culturally aware, and linguistically diverse language models.
Book a Demo
Automated Evaluation Service
By providing the API or model code, users can quickly and easily receive performance feedback on their models using our comprehensive dataset for model inference evaluation.
Book a Demo
Manual Evaluation Service
Leveraging user-submitted interfaces, models, and code, our expert team conducts detailed tests to deliver a comprehensive report for you.
Book a Demo

Explore More

Fill out the form to schedule a personalized demo with our team. Experience firsthand how our innovative solutions can meet your needs and drive success.

Book a Demo

Copyright © 2025 StardustAI Inc. All rights reserved.