HASYv2

Task: Image Classification Classes: 369 Image Size: 32x32

Overview

The HASYv2 dataset contains 168,236 handwritten symbol images spanning 369 classes. It includes Latin characters, numerals, and a wide variety of mathematical and scientific symbols (e.g., Greek letters, operators). The dataset is designed to benchmark classification algorithms on symbols with high intra-class variability and visual similarity.

The dataset is structured into 10 pre-defined folds for Cross-Validation. You can select a specific fold using the configuration name (fold-1 to fold-10).

Config Name

Description

Split Ratio

fold-1

Standard benchmark split (Default)

~90% Train / 10% Test

fold-2

Cross-validation split #2

~90% Train / 10% Test

fold-10

Cross-validation split #10

~90% Train / 10% Test

All images are 32×32 black-and-white (binary/grayscale).

../../_images/hasy_v2_teaser.png

Data Structure

When accessing an example using ds[i], you will receive a dictionary with the following keys:

Key

Type

Description

image

PIL.Image.Image

32×32 grayscale handwritten symbol image

label

int

Class label (0-368). Maps to the symbol ID (e.g., “31” for ‘1’).

Usage Example

Basic Usage

You can specify a config_name to choose which cross-validation fold to use. If not specified, it defaults to "fold-1".

from stable_datasets.images.hasy_v2 import HASYv2

# Load the standard benchmark split (Fold 1)
ds_train = HASYv2(config_name="fold-1", split="train")
ds_test = HASYv2(config_name="fold-1", split="test")

# Load a specific fold for cross-validation
ds_train_f5 = HASYv2(config_name="fold-5", split="train")

sample = ds_train[0]
print(sample.keys())  # {"image", "label"}

# Get the actual symbol ID string (e.g., "31")
label_id = ds_train.features["label"].int2str(sample["label"])
print(f"Symbol ID: {label_id}")

References

Citation

@article{thoma2017hasyv2,
  title={The hasyv2 dataset},
  author={Thoma, Martin},
  journal={arXiv preprint arXiv:1701.08380},
  year={2017}
}