As you may know, I helped found last year a company called Idibon, where we're building massively scalable
natural language processing. The following are some thoughts from our senior
data scientist, Tyler Schnoebelen, on the complexities of understanding the
complexity of language:
Natural language processing (NLP) is about finding patterns in language - for
example, taking heaps of unstructured text and automatically pulling out its
structure. The open secret about NLP is that it's very English-centric. English
is far and away the language that linguists have worked on the most and it's
also the language that has the most available resources for computer science
projects (and more data is almost always better in computer science). So one of
the best ways to test an NLP system is to try languages other than English. The
better that a system can deal with diverse data, the more confident that you
can be in its ability to handle unseen data.
The World Atlas of Language Structures
evaluates 2,676 different languages in terms of a bunch of different language
features. These features include word order, types of sounds, ways of doing
negation, and a lot of other things - 192 different language features in total.
So rather than take an English-centric view of the world, WALS allows us take a
worldwide view. That is, we evaluate each language in terms of how unusual it
is for each feature.
The language that is most different from the majority of all other languages
in the world is a verb-initial tonal languages spoken by 6,000 people in
Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el
Grande Mixtec). Number two is spoken in Siberia by 22,000 people:
Nenets (that's where we get the word parka from).
Number three is Choctaw spoken by about 10,000 people, mostly
But here's the rub - some of the weirdest languages in the world are ones you've
heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin. And actually
English is #33 in the Language Weirdness Index.
... in spite of being spoken today by over a billion people!