HiLEx

About HiLEx

Document Layout Analysis (DLA) has advanced significantly in structured domains like invoices, forms, and academic papers. However, the layout understanding of educational content – especially question papers – remains largely underexplored. These documents are inherently multi-modal and hierarchical, containing a mix of textual structures like instructions, questions, answer blocks, and descriptions, often organized in complex multi-column formats.

HiLEx (Hierarchical Layout Extraction) is the first large-scale benchmark dataset designed specifically for understanding the layout of question paper images. It comprises 1,965 exam pages collected from eight major exams (GMAT, GRE, SAT, JEE, UPSC, GATE, BANK, and UGC-NET). Each page image is manually annotated with six hierarchical classes: Question_Paper_Area, Question_Block, Answer_Block, Question_Answer_Block, Instruction, and Description. Annotations are provided in both YOLO and COCO formats and were verified by expert annotators, achieving a gold-standard inter-annotator agreement (Cohen’s κ > 0.90).

By enabling reliable layout extraction in educational contexts, HiLEx opens up opportunities for scalable document understanding, intelligent grading, and inclusive learning technologies. This aligns with broader goals of SDG 4 (Quality Education) and SDG 10 (Reduced Inequalities).

about img — *Fig:* Visualization of the Hierarchical Layout Annotation

HiLEx Dataset

The HiLEx dataset is a curated benchmark for hierarchical layout analysis in educational documents, specifically designed for question papers that exhibit diverse and complex visual structures. It addresses a major gap in existing DLA benchmarks by targeting the education domain—an area rich in practical use cases like exam digitization, automated grading, and content retrieval.

Key Statistics

1,965 Images

8 Exams Covered

English Language

6 Layout Classes

2 Annotation Formats

CC BY 4.0 License

Annotation Hierarchy

Each document image is annotated with a six-class hierarchical layout schema:

Question_Paper_Area: Full region containing the entire exam content.
Question_Answer_Block: Combined region containing both a question and its answer choices.
Question_Block: Question text only (excluding answer choices).
Answer_Block: Answer options (multiple choice options, numeric answers, etc.).
Instruction: Guidelines or section-level directions included in the paper.
Description: Explanatory content or solutions (e.g., an answer explanation segment).

This hierarchy captures the structured and nested nature of question papers across both single-column and multi-column layouts.

Annotation Process

All annotations were performed manually by three domain experts, followed by multi-phase quality control. Final annotations achieved Cohen’s κ > 0.90, indicating very high inter-annotator agreement and reliability.

Data Distribution & Access

HiLEx includes exam content from both national and international assessments, supporting cross-cultural analysis. The dataset is publicly available under the Creative Commons Attribution 4.0 (CC BY 4.0) license. You can access the data and tools from our GitHub repository:

GitHub: HiLEx-DLA/HiLEx

The repository provides:

Images: 1,965 question paper page images from 8 exams (listed above).
Annotations: Bounding boxes for the six layout classes, in both YOLO text files and COCO JSON format.
Documentation: A README with dataset details, class definitions, and sample visualization scripts.

Method

To evaluate the effectiveness of the HiLEx dataset, we benchmarked a diverse set of object detection models spanning multiple architectural paradigms. Each model was trained to detect the six hierarchical layout categories within question paper images, enabling a thorough comparison of their capabilities on this task.

Model Categories

We grouped the evaluated models into four families:

One-Stage Detectors

Fast, single-pass detectors that directly predict bounding boxes and classes in one go.

Examples: YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12

Two-Stage Detector

Region proposal-based model that first finds object regions, then classifies them.

Example: Detectron2 (Faster R-CNN backbone)

Transformer-Based Models

Encoder-decoder models with object query mechanisms for global context object detection.

Examples: DETR, RT-DETR

Vision-Language Models (VLMs)

Pre-trained multi-modal models that understand images and text jointly, allowing zero-shot layout detection.

Examples: Florence-2, PaLI-Gemma2

Implementation Details

All models were trained under consistent settings to ensure a fair comparison:

Platform: Google Colab with NVIDIA A100 GPU
Epochs: 25
Batch Size: 16
Optimizer: AdamW
Data Split: 80% Train / 10% Validation / 10% Test
Metrics: Precision, Recall, mAP@50, and mAP@50–95

Example Pipeline

During inference, each page image is processed by a trained model to detect hierarchical layout blocks (questions, answers, instructions, etc.). For YOLO-based detectors, we used Ultralytics YOLO pipelines; for transformer models like DETR, we leveraged HuggingFace and MMDetection frameworks. Model outputs were evaluated using IoU-based metrics and both per-class and overall averages.

A typical evaluation pipeline included:

Image pre-processing (resizing and normalization)
Model inference to obtain bounding boxes for each class
Metric computation (IoU, precision, recall for each class and overall)
Error analysis and class-wise performance breakdown

The results of this benchmarking are summarized below and on the Leaderboard, allowing comparison across model types and layout categories.

Results

We evaluated HiLEx across a diverse set of models representing four major detection paradigms: one-stage, two-stage, transformer, and vision-language. All models were trained and tested on the same dataset split and evaluation criteria to ensure comparability.

Key Highlights

Best overall performance: Detectron2 with mAP@50 of 94.0% (two-stage fine-tuned detector)
Strong one-stage performance: YOLOv8 achieved mAP@50 of 84.8% with competitive speed and generalization
Most balanced (high IoU) detection: YOLOv11 had the highest mAP@50–95 (68.9%), indicating robust localization
Vision-Language model promise: PaLI-Gemma2 (zero-shot) reached 83.5% mAP@50, showing feasibility of layout detection without fine-tuning

All models were evaluated on six layout classes with standard IoU thresholds (0.5 for mAP@50, and 0.5–0.95 for mAP@50–95). Detectron2, a two-stage model, attained the highest overall accuracy, while YOLO variants offered strong performance with lighter models. Vision-language models performed surprisingly well in zero-shot mode, though slightly behind fine-tuned models.

Class-wise Trends

Question_Paper_Area and Question_Answer_Block were consistently detected with high precision across models, likely due to their large size and distinctive structure. In contrast, Instruction and Description were the most challenging classes – many models struggled to detect these smaller, variably formatted elements, leading to lower recall for those classes. Notably, YOLOv11 excelled at localizing fine-grained Answer_Block regions, while PaLI-Gemma2 demonstrated strong generalization to unseen layouts despite not being fine-tuned.

Error Analysis

Common errors observed included over-detection in densely packed answer sections (multiple boxes around the same answer block) and missed instructions when their font style or placement differed from training examples. Vision-language models occasionally confused descriptions for question text blocks. These insights point to areas for future improvement, such as specialized handling of instruction text and better differentiation between similar text elements.

Visualizations

HiLEx features richly annotated pages with diverse layouts – from dense multi-column competitive exams to simpler single-question formats. Below we present key visual examples illustrating the dataset and model performance.

Ground Truth Layout Annotations

Each question paper image is annotated with color-coded boxes for the six structural classes, providing a hierarchical segmentation of the page content.

Single-Column vs Multi-Column: HiLEx includes both single-column papers and multi-column layouts. The annotations adapt to each style, capturing blocks appropriately in each format.

Model Predictions

Below are sample outputs from fine-tuned detection models on HiLEx pages. Detected layout components are drawn with class-specific colors.

YOLOv8 model prediction on HiLEx page — *Fig:* Sample predictions from different Transformer-based Object Detector Models

Detectron2 model prediction on HiLEx page — *Fig:* Sample prediction from Two-Stage Object Detector Model

Performance Comparison

We visualize the detection performance across models to highlight accuracy vs. complexity trade-offs.

Failure Cases

Certain layout elements remained difficult. Two common failure modes are shown below: (i) missed instructions due to unusual formatting or placement, and (ii) merged detections where question and answer text were not clearly separated.

Examples of HiLEx model detection errors — *Fig:* Results showing missed instructions due to unusual formatting or placement

These visualizations help pinpoint where models perform well and where improvements are needed, guiding future research on hierarchical layout understanding.

Leaderboard

The table below presents the benchmarking results of various models on the HiLEx dataset. Models are grouped by architecture type. We report performance using mAP@50 and mAP@50–95 (mean average precision at IoU 0.5 and 0.5:0.95, respectively), which are standard object detection metrics.

All models were trained on the same training set and evaluated on an identical test set of HiLEx (covering all six layout classes). This ensures an apples-to-apples comparison of model capabilities.

Model	Type	Training	Params	mAP@50	mAP@50–95
Detectron2	Two-Stage	Fine-tuned	~41M	94.0%	67.3%
YOLOv8	One-Stage	Fine-tuned	~35M	84.8%	66.1%
YOLOv11	One-Stage	Fine-tuned	~50M	82.2%	68.9%
RT-DETR	Transformer	Fine-tuned	~50M	60.6%	44.8%
DETR	Transformer	Fine-tuned	~86M	50.2%	35.0%
PaLI-Gemma2	Vision-Language	Zero-shot	>1B	83.5%	59.0%
Florence-2	Vision-Language	Zero-shot	>1B	81.0%	59.0%

Note: All metrics are averaged across the six layout classes. Vision-Language models (Florence-2, PaLI-Gemma2) were evaluated in zero-shot mode (no fine-tuning on HiLEx).

Insights

Most accurate class overall: Question_Paper_Area (easiest to detect due to its large size and page-border placement).
Most challenging classes: Instruction and Description (due to small size and varied formatting).
Top fine-tuned model: Detectron2 (highest mAP@50, excels at precise detection).
Best generalization: YOLOv11 (highest mAP@50–95, indicating good performance on stricter IoU).
Zero-shot leader: PaLI-Gemma2 (demonstrated strong results without any training on HiLEx).

Submission Instructions

If you evaluate a new model on HiLEx and wish to have it listed on the leaderboard:

Fork the GitHub repository and add your results to the leaderboard.json (or relevant file).
Include the model name, architecture type, parameter count, whether it was fine-tuned or zero-shot, and its mAP@50 and mAP@50–95 on HiLEx.
Include a citation (BibTeX or URL) for your model/paper if applicable.
Open a Pull Request with your changes. Our team will review it and, upon verification, merge it to update the official leaderboard.

We encourage the community to test novel models on HiLEx – let’s drive forward progress in document layout understanding!

Paper

HiLEx: Image-based Hierarchical Layout Extraction from Question Papers
Presented at the International Conference on Document Analysis and Recognition (ICDAR) 2025

Authors: [Names omitted for brevity]

📄 Preprint on arXiv | 📘 PDF Download | 💻 GitHub Repo

Abstract

We introduce HiLEx, a new large-scale dataset and benchmark for hierarchical layout analysis in question paper images. HiLEx includes 1,965 exam pages annotated with six layout elements (questions, answers, instructions, etc.), covering content from eight diverse exams. We benchmark a range of models — including YOLOv8–12, Detectron2, RT-DETR, Florence-2, and PaLI-Gemma2 — and report state-of-the-art results using a two-stage fine-tuned detector. Our findings highlight the challenges of detecting instructions and descriptions, and the promise of vision-language models for future multi-modal layout tasks. This work aims to advance automated document understanding in education, supporting inclusive learning technologies aligned with SDG 4 and SDG 10.

For full details and results, please refer to the paper and code linked above.

Citation

If you use the HiLEx dataset, benchmarks, or tools in your research, please cite our paper:

@InProceedings{10.1007/978-3-032-04627-7_28,
  author    = {Aich, Utathya and Chakraborty, Shinjini and Sadhukhan, Deepan and Ghosh, Swarnendu and Saha, Tulika},
  editor    = {Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel},
  title     = {HiLEx: Image-Based Hierarchical Layout Extraction from Question Papers},
  booktitle = {Document Analysis and Recognition --  ICDAR 2025},
  year      = {2026},
  publisher = {Springer Nature Switzerland},
  address   = {Cham},
  pages     = {485--505},
  abstract  = {Education is the cornerstone of societal progress, yet automated document layout understanding in the education domain remains significantly under-explored, with most research focusing on individual components like texts, tables, images instead of a holistic understanding. Despite the increasing demand for AI-driven assessment, digitization, and retrieval of educational resources, very few dedicated works exists for question paper layout analysis, a critical component of automated learning systems. To bridge this gap, we introduce the first dataset HiLEx, explicitly designed for structure analysis and layout extraction from question paper images. The HiLEx dataset is curated from eight diverse examination formats. With over 1900 annotations with different structural layouts, covering both single-column and multi-column layouts, ensure robust generalization across different structural variations. We conduct a thorough empirical study with most contemporary object detection models, exposing their limitations in structural understanding, and format generalization. Our findings lay the groundwork for Smart AI solutions in education, fostering automated grading, question retrieval, and equitable learning access. This research aligns with UN Sustainable Development Goals (SDG 4: Quality Education, SDG 10: Reduced Inequalities) by enabling scalable, AI-driven assessment technologies, promoting inclusivity, and revolutionizing educational accessibility worldwide. The HiLEx dataset is publicly available in Github (https://github.com/HiLEx-DLA/HiLEx).},
  isbn      = {978-3-032-04627-7}
}

Contact

The HiLEx project is a collaborative effort by researchers working at the intersection of document analysis, computer vision, and educational AI. We welcome feedback and collaboration inquiries.

Email: your.email@institution.edu

GitHub Issues: HiLEx-DLA/HiLEx

Whether you are building educational AI tools, benchmarking layout models, or exploring document analysis in new domains – we’d love to hear from you!