Hebrew is one of those languages that remains underserved in NLP. While English datasets are abundant, researchers and practitioners working with Hebrew often face a frustrating gap: high-quality, large-scale sentence-level corpora are hard to come by. I decided to do something about it.
The Dataset
The Hebrew Wikipedia Sentences Corpus is a collection of 10,999,257 cleaned, deduplicated Hebrew sentences, extracted from 366,610 Hebrew Wikipedia articles. The entire dataset is 1.8 GB in Parquet format and released under CC BY-SA 3.0.
Each sentence comes with rich metadata:
- Article context — the source article ID, title, and Wikipedia categories
- Position tracking — where the sentence appeared within its article
- Quality signals — word count and Hebrew character ratio
The sentences average 16.6 words, with a median Hebrew ratio of 1.0 (fully Hebrew text). Every sentence is between 5 and 50 words, ensuring a useful range for most NLP tasks.
How It Was Built
The pipeline has three stages:
- Crawl — All Hebrew Wikipedia articles were fetched via the MediaWiki API
- Extract — Wikitext was converted to plain text and split into sentences, with filtering for length (5–50 words), Hebrew character ratio (≥50%), and content quality
- Deduplicate — Exact duplicates were removed using SHA-256 hashing
This approach prioritizes clean, usable data over raw volume. Wikipedia's encyclopedic register means the text is well-structured and grammatically sound, though it skews formal.
Use Cases
This corpus is designed to support a range of Hebrew NLP tasks:
- Language model pretraining and fine-tuning — 11M sentences provide substantial training signal
- Text classification — Article categories enable supervised and semi-supervised approaches
- Sentence similarity and semantic search — Clean, deduplicated sentences make strong training pairs
- Named Entity Recognition — Wikipedia text is rich with named entities
- Benchmarking — A standardized corpus for comparing Hebrew NLP models
Getting Started
from datasets import load_dataset
ds = load_dataset("tomron87/hebrew-wikipedia-sentences-corpus")
print(f"Total sentences: {len(ds['train']):,}")
print(ds["train"][0])
Limitations Worth Noting
The corpus reflects Wikipedia's characteristics: formal register, encyclopedic tone, and uneven topic coverage driven by editor demographics. It does not include spoken Hebrew, social media language, or informal writing. It is a snapshot from February 2026 and will not update automatically.
For applications requiring colloquial Hebrew or domain-specific language (medical, legal, etc.), this corpus works best as a foundation to supplement with targeted data.
Access the Dataset
The full dataset is available on Hugging Face: tomron87/hebrew-wikipedia-sentences-corpus
If you use this dataset in your research, please cite it:
@dataset{hebrew_wikipedia_sentences,
title = {Hebrew Wikipedia Sentences},
author = {Tom Ron},
year = {2026},
url = {https://huggingface.co/datasets/tomron87/hebrew-wikipedia-sentences-corpus},
license = {CC BY-SA 3.0}
}
I'd love to hear what you build with it.