HousingQA (Knowledge)

This task evaluates model knowledge of housing law (specifically focused on eviction) from the year 2021. Models are prompted with yes/no questions about housing law across different states, and expected to answer using only knowledge stored in their weights. To learn more about HousingQA, see here.

Rank Model accuracy f1_macro Date Results
1 gpt-5-2025-08-07 0.715 0.705 2025-08-08 View
2 claude-3-haiku-20240307 0.593 0.588 2025-08-04 View
3 claude-3-5-haiku-20241022 0.584 0.580 2025-08-04 View
4 gpt-4o-mini-2024-07-18 0.544 0.544 2025-08-04 View

Tasks in This Benchmark