Document Layout Analysis (DLA) has advanced significantly in structured domains like invoices, forms, and academic papers. However, the layout understanding of educational content – especially question papers – remains largely underexplored. These documents are inherently multi-modal and hierarchical, containing a mix of textual structures like instructions, questions, answer blocks, and descriptions, often organized in complex multi-column formats.
HiLEx (Hierarchical Layout Extraction) is the first large-scale benchmark dataset designed specifically for understanding the layout of question paper images. It comprises 1,965 exam pages collected from eight major exams (GMAT, GRE, SAT, JEE, UPSC, GATE, BANK, and UGC-NET). Each page image is manually annotated with six hierarchical classes: Question_Paper_Area, Question_Block, Answer_Block, Question_Answer_Block, Instruction, and Description. Annotations are provided in both YOLO and COCO formats and were verified by expert annotators, achieving a gold-standard inter-annotator agreement (Cohen’s κ > 0.90).
By enabling reliable layout extraction in educational contexts, HiLEx opens up opportunities for scalable document understanding, intelligent grading, and inclusive learning technologies. This aligns with broader goals of SDG 4 (Quality Education) and SDG 10 (Reduced Inequalities).
The HiLEx dataset is a curated benchmark for hierarchical layout analysis in educational documents, specifically designed for question papers that exhibit diverse and complex visual structures. It addresses a major gap in existing DLA benchmarks by targeting the education domain—an area rich in practical use cases like exam digitization, automated grading, and content retrieval.
Each document image is annotated with a six-class hierarchical layout schema:
This hierarchy captures the structured and nested nature of question papers across both single-column and multi-column layouts.
All annotations were performed manually by three domain experts, followed by multi-phase quality control. Final annotations achieved Cohen’s κ > 0.90, indicating very high inter-annotator agreement and reliability.
HiLEx includes exam content from both national and international assessments, supporting cross-cultural analysis. The dataset is publicly available under the Creative Commons Attribution 4.0 (CC BY 4.0) license. You can access the data and tools from our GitHub repository:
GitHub: HiLEx-DLA/HiLEx
The repository provides:
To evaluate the effectiveness of the HiLEx dataset, we benchmarked a diverse set of object detection models spanning multiple architectural paradigms. Each model was trained to detect the six hierarchical layout categories within question paper images, enabling a thorough comparison of their capabilities on this task.
We grouped the evaluated models into four families:
Fast, single-pass detectors that directly predict bounding boxes and classes in one go.
Examples: YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12
Region proposal-based model that first finds object regions, then classifies them.
Example: Detectron2 (Faster R-CNN backbone)
Encoder-decoder models with object query mechanisms for global context object detection.
Examples: DETR, RT-DETR
Pre-trained multi-modal models that understand images and text jointly, allowing zero-shot layout detection.
Examples: Florence-2, PaLI-Gemma2
All models were trained under consistent settings to ensure a fair comparison:
During inference, each page image is processed by a trained model to detect hierarchical layout blocks (questions, answers, instructions, etc.). For YOLO-based detectors, we used Ultralytics YOLO pipelines; for transformer models like DETR, we leveraged HuggingFace and MMDetection frameworks. Model outputs were evaluated using IoU-based metrics and both per-class and overall averages.
A typical evaluation pipeline included:
The results of this benchmarking are summarized below and on the Leaderboard, allowing comparison across model types and layout categories.
We evaluated HiLEx across a diverse set of models representing four major detection paradigms: one-stage, two-stage, transformer, and vision-language. All models were trained and tested on the same dataset split and evaluation criteria to ensure comparability.
All models were evaluated on six layout classes with standard IoU thresholds (0.5 for mAP@50, and 0.5–0.95 for mAP@50–95). Detectron2, a two-stage model, attained the highest overall accuracy, while YOLO variants offered strong performance with lighter models. Vision-language models performed surprisingly well in zero-shot mode, though slightly behind fine-tuned models.
Question_Paper_Area and Question_Answer_Block were consistently detected with high precision across models, likely due to their large size and distinctive structure. In contrast, Instruction and Description were the most challenging classes – many models struggled to detect these smaller, variably formatted elements, leading to lower recall for those classes. Notably, YOLOv11 excelled at localizing fine-grained Answer_Block regions, while PaLI-Gemma2 demonstrated strong generalization to unseen layouts despite not being fine-tuned.
Common errors observed included over-detection in densely packed answer sections (multiple boxes around the same answer block) and missed instructions when their font style or placement differed from training examples. Vision-language models occasionally confused descriptions for question text blocks. These insights point to areas for future improvement, such as specialized handling of instruction text and better differentiation between similar text elements.
HiLEx features richly annotated pages with diverse layouts – from dense multi-column competitive exams to simpler single-question formats. Below we present key visual examples illustrating the dataset and model performance.
Each question paper image is annotated with color-coded boxes for the six structural classes, providing a hierarchical segmentation of the page content.
Single-Column vs Multi-Column: HiLEx includes both single-column papers and multi-column layouts. The annotations adapt to each style, capturing blocks appropriately in each format.
Below are sample outputs from fine-tuned detection models on HiLEx pages. Detected layout components are drawn with class-specific colors.
We visualize the detection performance across models to highlight accuracy vs. complexity trade-offs.
Certain layout elements remained difficult. Two common failure modes are shown below: (i) missed instructions due to unusual formatting or placement, and (ii) merged detections where question and answer text were not clearly separated.
These visualizations help pinpoint where models perform well and where improvements are needed, guiding future research on hierarchical layout understanding.
The table below presents the benchmarking results of various models on the HiLEx dataset. Models are grouped by architecture type. We report performance using mAP@50 and mAP@50–95 (mean average precision at IoU 0.5 and 0.5:0.95, respectively), which are standard object detection metrics.
All models were trained on the same training set and evaluated on an identical test set of HiLEx (covering all six layout classes). This ensures an apples-to-apples comparison of model capabilities.
| Model | Type | Training | Params | mAP@50 | mAP@50–95 |
|---|---|---|---|---|---|
| Detectron2 | Two-Stage | Fine-tuned | ~41M | 94.0% | 67.3% |
| YOLOv8 | One-Stage | Fine-tuned | ~35M | 84.8% | 66.1% |
| YOLOv11 | One-Stage | Fine-tuned | ~50M | 82.2% | 68.9% |
| RT-DETR | Transformer | Fine-tuned | ~50M | 60.6% | 44.8% |
| DETR | Transformer | Fine-tuned | ~86M | 50.2% | 35.0% |
| PaLI-Gemma2 | Vision-Language | Zero-shot | >1B | 83.5% | 59.0% |
| Florence-2 | Vision-Language | Zero-shot | >1B | 81.0% | 59.0% |
Note: All metrics are averaged across the six layout classes. Vision-Language models (Florence-2, PaLI-Gemma2) were evaluated in zero-shot mode (no fine-tuning on HiLEx).
If you evaluate a new model on HiLEx and wish to have it listed on the leaderboard:
leaderboard.json (or relevant file).We encourage the community to test novel models on HiLEx – let’s drive forward progress in document layout understanding!
HiLEx: Image-based Hierarchical Layout Extraction from Question Papers
Presented at the International Conference on Document Analysis and Recognition (ICDAR) 2025
Authors: [Names omitted for brevity]
📄 Preprint on arXiv | 📘 PDF Download | 💻 GitHub Repo
We introduce HiLEx, a new large-scale dataset and benchmark for hierarchical layout analysis in question paper images. HiLEx includes 1,965 exam pages annotated with six layout elements (questions, answers, instructions, etc.), covering content from eight diverse exams. We benchmark a range of models — including YOLOv8–12, Detectron2, RT-DETR, Florence-2, and PaLI-Gemma2 — and report state-of-the-art results using a two-stage fine-tuned detector. Our findings highlight the challenges of detecting instructions and descriptions, and the promise of vision-language models for future multi-modal layout tasks. This work aims to advance automated document understanding in education, supporting inclusive learning technologies aligned with SDG 4 and SDG 10.
For full details and results, please refer to the paper and code linked above.
If you use the HiLEx dataset, benchmarks, or tools in your research, please cite our paper:
@InProceedings{10.1007/978-3-032-04627-7_28,
author = {Aich, Utathya and Chakraborty, Shinjini and Sadhukhan, Deepan and Ghosh, Swarnendu and Saha, Tulika},
editor = {Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel},
title = {HiLEx: Image-Based Hierarchical Layout Extraction from Question Papers},
booktitle = {Document Analysis and Recognition -- ICDAR 2025},
year = {2026},
publisher = {Springer Nature Switzerland},
address = {Cham},
pages = {485--505},
abstract = {Education is the cornerstone of societal progress, yet automated document layout understanding in the education domain remains significantly under-explored, with most research focusing on individual components like texts, tables, images instead of a holistic understanding. Despite the increasing demand for AI-driven assessment, digitization, and retrieval of educational resources, very few dedicated works exists for question paper layout analysis, a critical component of automated learning systems. To bridge this gap, we introduce the first dataset HiLEx, explicitly designed for structure analysis and layout extraction from question paper images. The HiLEx dataset is curated from eight diverse examination formats. With over 1900 annotations with different structural layouts, covering both single-column and multi-column layouts, ensure robust generalization across different structural variations. We conduct a thorough empirical study with most contemporary object detection models, exposing their limitations in structural understanding, and format generalization. Our findings lay the groundwork for Smart AI solutions in education, fostering automated grading, question retrieval, and equitable learning access. This research aligns with UN Sustainable Development Goals (SDG 4: Quality Education, SDG 10: Reduced Inequalities) by enabling scalable, AI-driven assessment technologies, promoting inclusivity, and revolutionizing educational accessibility worldwide. The HiLEx dataset is publicly available in Github (https://github.com/HiLEx-DLA/HiLEx).},
isbn = {978-3-032-04627-7}
}
The HiLEx project is a collaborative effort by researchers working at the intersection of document analysis, computer vision, and educational AI. We welcome feedback and collaboration inquiries.
Email: your.email@institution.edu
GitHub Issues: HiLEx-DLA/HiLEx
Whether you are building educational AI tools, benchmarking layout models, or exploring document analysis in new domains – we’d love to hear from you!