HiLEx: Image-based Hierarchical Layout Extraction
from Question Papers


Utathya Aich, Shinjini Chakraborty, Deepan Sadhukhan, Tulika Saha, and Swarnendu Ghosh

About HiLEx

Document Layout Analysis (DLA) has advanced significantly in structured domains like invoices, forms, and academic papers. However, the layout understanding of educational content – especially question papers – remains largely underexplored. These documents are inherently multi-modal and hierarchical, containing a mix of textual structures like instructions, questions, answer blocks, and descriptions, often organized in complex multi-column formats.

HiLEx (Hierarchical Layout Extraction) is the first large-scale benchmark dataset designed specifically for understanding the layout of question paper images. It comprises 1,965 exam pages collected from eight major exams (GMAT, GRE, SAT, JEE, UPSC, GATE, BANK, and UGC-NET). Each page image is manually annotated with six hierarchical classes: Question_Paper_Area, Question_Block, Answer_Block, Question_Answer_Block, Instruction, and Description. Annotations are provided in both YOLO and COCO formats and were verified by expert annotators, achieving a gold-standard inter-annotator agreement (Cohen’s κ > 0.90).

By enabling reliable layout extraction in educational contexts, HiLEx opens up opportunities for scalable document understanding, intelligent grading, and inclusive learning technologies. This aligns with broader goals of SDG 4 (Quality Education) and SDG 10 (Reduced Inequalities).

about img
Fig: Visualization of the Hierarchical Layout Annotation

HiLEx Dataset

The HiLEx dataset is a curated benchmark for hierarchical layout analysis in educational documents, specifically designed for question papers that exhibit diverse and complex visual structures. It addresses a major gap in existing DLA benchmarks by targeting the education domain—an area rich in practical use cases like exam digitization, automated grading, and content retrieval.

Key Statistics

1,965 Images
8 Exams Covered
English Language
6 Layout Classes
2 Annotation Formats
CC BY 4.0 License

Annotation Hierarchy

Each document image is annotated with a six-class hierarchical layout schema:

This hierarchy captures the structured and nested nature of question papers across both single-column and multi-column layouts.

Annotation Process

All annotations were performed manually by three domain experts, followed by multi-phase quality control. Final annotations achieved Cohen’s κ > 0.90, indicating very high inter-annotator agreement and reliability.

Data Distribution & Access

HiLEx includes exam content from both national and international assessments, supporting cross-cultural analysis. The dataset is publicly available under the Creative Commons Attribution 4.0 (CC BY 4.0) license. You can access the data and tools from our GitHub repository:

GitHub: HiLEx-DLA/HiLEx

The repository provides:

Method

To evaluate the effectiveness of the HiLEx dataset, we benchmarked a diverse set of object detection models spanning multiple architectural paradigms. Each model was trained to detect the six hierarchical layout categories within question paper images, enabling a thorough comparison of their capabilities on this task.

Model Categories

We grouped the evaluated models into four families:

One-Stage Detectors

Fast, single-pass detectors that directly predict bounding boxes and classes in one go.

Examples: YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12

Two-Stage Detector

Region proposal-based model that first finds object regions, then classifies them.

Example: Detectron2 (Faster R-CNN backbone)

Transformer-Based Models

Encoder-decoder models with object query mechanisms for global context object detection.

Examples: DETR, RT-DETR

Vision-Language Models (VLMs)

Pre-trained multi-modal models that understand images and text jointly, allowing zero-shot layout detection.

Examples: Florence-2, PaLI-Gemma2

Implementation Details

All models were trained under consistent settings to ensure a fair comparison:

Example Pipeline

During inference, each page image is processed by a trained model to detect hierarchical layout blocks (questions, answers, instructions, etc.). For YOLO-based detectors, we used Ultralytics YOLO pipelines; for transformer models like DETR, we leveraged HuggingFace and MMDetection frameworks. Model outputs were evaluated using IoU-based metrics and both per-class and overall averages.

A typical evaluation pipeline included:

The results of this benchmarking are summarized below and on the Leaderboard, allowing comparison across model types and layout categories.

Results

We evaluated HiLEx across a diverse set of models representing four major detection paradigms: one-stage, two-stage, transformer, and vision-language. All models were trained and tested on the same dataset split and evaluation criteria to ensure comparability.

Key Highlights

All models were evaluated on six layout classes with standard IoU thresholds (0.5 for mAP@50, and 0.5–0.95 for mAP@50–95). Detectron2, a two-stage model, attained the highest overall accuracy, while YOLO variants offered strong performance with lighter models. Vision-language models performed surprisingly well in zero-shot mode, though slightly behind fine-tuned models.

Class-wise Trends

Question_Paper_Area and Question_Answer_Block were consistently detected with high precision across models, likely due to their large size and distinctive structure. In contrast, Instruction and Description were the most challenging classes – many models struggled to detect these smaller, variably formatted elements, leading to lower recall for those classes. Notably, YOLOv11 excelled at localizing fine-grained Answer_Block regions, while PaLI-Gemma2 demonstrated strong generalization to unseen layouts despite not being fine-tuned.

Error Analysis

Common errors observed included over-detection in densely packed answer sections (multiple boxes around the same answer block) and missed instructions when their font style or placement differed from training examples. Vision-language models occasionally confused descriptions for question text blocks. These insights point to areas for future improvement, such as specialized handling of instruction text and better differentiation between similar text elements.

Visualizations

HiLEx features richly annotated pages with diverse layouts – from dense multi-column competitive exams to simpler single-question formats. Below we present key visual examples illustrating the dataset and model performance.

Ground Truth Layout Annotations

Each question paper image is annotated with color-coded boxes for the six structural classes, providing a hierarchical segmentation of the page content.

Sample HiLEx page with ground truth layout annotations
Fig: Example annotations on a question paper page, highlighting Question, Answer, Instruction, Description, etc.

Single-Column vs Multi-Column: HiLEx includes both single-column papers and multi-column layouts. The annotations adapt to each style, capturing blocks appropriately in each format.

Model Predictions

Below are sample outputs from fine-tuned detection models on HiLEx pages. Detected layout components are drawn with class-specific colors.

Sample HiLEx page with ground truth layout annotations
Fig: Sample predictions from different One-Stage Detector Models
YOLOv8 model prediction on HiLEx page
Fig: Sample predictions from different Transformer-based Object Detector Models
Detectron2 model prediction on HiLEx page
Fig: Sample prediction from Two-Stage Object Detector Model
Sample HiLEx page with ground truth layout annotations
Fig: Sample predictions from different Vision-Language Models

Performance Comparison

We visualize the detection performance across models to highlight accuracy vs. complexity trade-offs.

Image 1 Image 2 Image 3 Image 4
Fig: Performance comparison between the different models of HiLEx.
The first row shows One-Stage models and Transformer-based models; the second row shows Two-Stage models and Vision-Language Models.

Failure Cases

Certain layout elements remained difficult. Two common failure modes are shown below: (i) missed instructions due to unusual formatting or placement, and (ii) merged detections where question and answer text were not clearly separated.

Examples of HiLEx model detection errors
Fig: Results showing missed instructions due to unusual formatting or placement
Examples of HiLEx model detection errors
Fig: Results showing excessive merging of question and answer text

These visualizations help pinpoint where models perform well and where improvements are needed, guiding future research on hierarchical layout understanding.

Leaderboard

The table below presents the benchmarking results of various models on the HiLEx dataset. Models are grouped by architecture type. We report performance using mAP@50 and mAP@50–95 (mean average precision at IoU 0.5 and 0.5:0.95, respectively), which are standard object detection metrics.

All models were trained on the same training set and evaluated on an identical test set of HiLEx (covering all six layout classes). This ensures an apples-to-apples comparison of model capabilities.

Model Type Training Params mAP@50 mAP@50–95
Detectron2 Two-Stage Fine-tuned ~41M 94.0% 67.3%
YOLOv8 One-Stage Fine-tuned ~35M 84.8% 66.1%
YOLOv11 One-Stage Fine-tuned ~50M 82.2% 68.9%
RT-DETR Transformer Fine-tuned ~50M 60.6% 44.8%
DETR Transformer Fine-tuned ~86M 50.2% 35.0%
PaLI-Gemma2 Vision-Language Zero-shot >1B 83.5% 59.0%
Florence-2 Vision-Language Zero-shot >1B 81.0% 59.0%

Note: All metrics are averaged across the six layout classes. Vision-Language models (Florence-2, PaLI-Gemma2) were evaluated in zero-shot mode (no fine-tuning on HiLEx).

Insights

Submission Instructions

If you evaluate a new model on HiLEx and wish to have it listed on the leaderboard:

  1. Fork the GitHub repository and add your results to the leaderboard.json (or relevant file).
  2. Include the model name, architecture type, parameter count, whether it was fine-tuned or zero-shot, and its mAP@50 and mAP@50–95 on HiLEx.
  3. Include a citation (BibTeX or URL) for your model/paper if applicable.
  4. Open a Pull Request with your changes. Our team will review it and, upon verification, merge it to update the official leaderboard.

We encourage the community to test novel models on HiLEx – let’s drive forward progress in document layout understanding!

Paper

HiLEx: Image-based Hierarchical Layout Extraction from Question Papers
Presented at the International Conference on Document Analysis and Recognition (ICDAR) 2025

Authors: [Names omitted for brevity]

📄 Preprint on arXiv   |   📘 PDF Download   |   💻 GitHub Repo

Abstract

We introduce HiLEx, a new large-scale dataset and benchmark for hierarchical layout analysis in question paper images. HiLEx includes 1,965 exam pages annotated with six layout elements (questions, answers, instructions, etc.), covering content from eight diverse exams. We benchmark a range of models — including YOLOv8–12, Detectron2, RT-DETR, Florence-2, and PaLI-Gemma2 — and report state-of-the-art results using a two-stage fine-tuned detector. Our findings highlight the challenges of detecting instructions and descriptions, and the promise of vision-language models for future multi-modal layout tasks. This work aims to advance automated document understanding in education, supporting inclusive learning technologies aligned with SDG 4 and SDG 10.

For full details and results, please refer to the paper and code linked above.

Citation

If you use the HiLEx dataset, benchmarks, or tools in your research, please cite our paper:

Copied!
@InProceedings{10.1007/978-3-032-04627-7_28,
  author    = {Aich, Utathya and Chakraborty, Shinjini and Sadhukhan, Deepan and Ghosh, Swarnendu and Saha, Tulika},
  editor    = {Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel},
  title     = {HiLEx: Image-Based Hierarchical Layout Extraction from Question Papers},
  booktitle = {Document Analysis and Recognition --  ICDAR 2025},
  year      = {2026},
  publisher = {Springer Nature Switzerland},
  address   = {Cham},
  pages     = {485--505},
  abstract  = {Education is the cornerstone of societal progress, yet automated document layout understanding in the education domain remains significantly under-explored, with most research focusing on individual components like texts, tables, images instead of a holistic understanding. Despite the increasing demand for AI-driven assessment, digitization, and retrieval of educational resources, very few dedicated works exists for question paper layout analysis, a critical component of automated learning systems. To bridge this gap, we introduce the first dataset HiLEx, explicitly designed for structure analysis and layout extraction from question paper images. The HiLEx dataset is curated from eight diverse examination formats. With over 1900 annotations with different structural layouts, covering both single-column and multi-column layouts, ensure robust generalization across different structural variations. We conduct a thorough empirical study with most contemporary object detection models, exposing their limitations in structural understanding, and format generalization. Our findings lay the groundwork for Smart AI solutions in education, fostering automated grading, question retrieval, and equitable learning access. This research aligns with UN Sustainable Development Goals (SDG 4: Quality Education, SDG 10: Reduced Inequalities) by enabling scalable, AI-driven assessment technologies, promoting inclusivity, and revolutionizing educational accessibility worldwide. The HiLEx dataset is publicly available in Github (https://github.com/HiLEx-DLA/HiLEx).},
  isbn      = {978-3-032-04627-7}
}

Contact

The HiLEx project is a collaborative effort by researchers working at the intersection of document analysis, computer vision, and educational AI. We welcome feedback and collaboration inquiries.

Email: your.email@institution.edu

GitHub Issues: HiLEx-DLA/HiLEx

Whether you are building educational AI tools, benchmarking layout models, or exploring document analysis in new domains – we’d love to hear from you!