a weblog by Schuyler D. Erle

Wed, 09 Oct 2013

[15:51] The Weirdest Languages in the World

As you may know, I helped found last year a company called Idibon, where we're building massively scalable natural language processing. The following are some thoughts from our senior data scientist, Tyler Schnoebelen, on the complexities of understanding the complexity of language:

Natural language processing (NLP) is about finding patterns in language - for example, taking heaps of unstructured text and automatically pulling out its structure. The open secret about NLP is that it's very English-centric. English is far and away the language that linguists have worked on the most and it's also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). So one of the best ways to test an NLP system is to try languages other than English. The better that a system can deal with diverse data, the more confident that you can be in its ability to handle unseen data.

The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things - 192 different language features in total.

So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature.

The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). Number two is spoken in Siberia by 22,000 people: Nenets (that's where we get the word parka from). Number three is Choctaw spoken by about 10,000 people, mostly in Oklahoma.

But here's the rub - some of the weirdest languages in the world are ones you've heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin. And actually English is #33 in the Language Weirdness Index.

... in spite of being spoken today by over a billion people!

Read more...

Currently
· Idibon
· Humanitarian OSM Team

Elsewhere
· Twitter
· Flickr
· Github
· LinkedIn

Previously

Before That
[2010] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
[2009] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
[2006] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
[2005] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
[2004] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
[2003] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
[2002] Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec

© copyright 2002-2013 Schuyler Erle * [email protected]
All original material on this website is licensed under the Creative Commons.
= still powered by Blosxom (after all these years) =