Somali WaC [2016]

Somali web corpus, collected using SpiderLing in January 2016. The dataset is encoded in UTF-8 format, cleaned, and deduplicated.

Counts
Tokens 79,741,231
Words 71,871,585
Sentences 2,643,336
Paragraphs 1,937,758
Documents 385,338
General info
Corpus description Document
Language Somali
Encoding UTF-8
Compiled 12/16/2022 18:04:21
Word sketch grammar Definition
Lexicon sizes
word 1,399,350
tag 13
lc 1,159,063

Structures and attributes:

Lexicon sizes
doc 385,338
p 1,937,758