Somali web corpus, collected using SpiderLing in January 2016. The dataset is encoded in UTF-8 format, cleaned, and deduplicated.
| Counts | |
|---|---|
| Tokens | 79,741,231 |
| Words | 71,871,585 |
| Sentences | 2,643,336 |
| Paragraphs | 1,937,758 |
| Documents | 385,338 |
| General info | |
|---|---|
| Corpus description | Document |
| Language | Somali |
| Encoding | UTF-8 |
| Compiled | 12/16/2022 18:04:21 |
| Word sketch grammar | Definition |
| Lexicon sizes | |
|---|---|
| word | 1,399,350 |
| tag | 13 |
| lc | 1,159,063 |
| Lexicon sizes | |
|---|---|
| doc | 385,338 |
| p | 1,937,758 |