2009.10.18 World languages by family and speaking population

During my linguistics fugue this weekend, which also produced the subject of the previous post, I happened upon the Wikipedia list of world languages by speaking population. It occurred to me that it would be interesting to cross-reference speaking population with language families. So, I tabulated all the 272 languages listed in the Wikipedia article as having one million or more native speakers by population and family in an Excel spreadsheet. I then wanted to display this information graphically. I only separated the seven most common families, aggregating all the languages not in those families in a single miscellaneous category. First I decided to represent each language as a horizontal bar, with each pixel of width representing one million people, and to place them end-to-end in decreasing order of speaking population. I summed up the rounded population figures of all the languages (some of which I had to synthesize from the different statistics given in the article) and the sum, by a funny coincidence, turned out to be 5280 (i.e., the number of feet in a mile), representing 5.28 billion people. I represented each language family by a color:

I used two shades of each of these colors in order to differentiate between adjacent members of the same language family. Because 5280 pixels is annoyingly wide and 5280 is conveniently divisible by many factors, I divided it into twelve equal lengths of 440 pixels and stacked those up, so that each segment represents 440 million people. Here's the result:

representation of languages colored by family, with width representing speaking population

The main points of this organizational scheme are to give a quickly understandable idea of what language families more and less widely-spoken languages tend to belong to, and to give an idea of the total speaking population of those families. The first language, taking up most of the first two bars at 845 million speakers, is, of course, Mandarin. The next three are Indo-European: Spanish, English, and Hindi/Urdu; they are followed by Arabic, and then by more Indo-European languages, namely Bengali, Portuguese, and Russian. This concentration of Indo-European languages at the top is the main reason why Indo-European is most populous language family. Japanese, the first miscellaneous language (since it's almost an isolate) is next, followed by German, and then by Javanese (the largest Austronesian language), Wu (the Chinese language spoken in a large region encompassing Shanghai), and Korean. (For more specifics, refer back to the Wikipedia article.) An interesting thing to see further down is that no Niger-Congo language is particularly large, but that there are a load of them with about 1–20 million speakers, in contrast with a family like Sino-Tibetan, which has by far the most populous language in the world and a number of other large languages, but not all that many members overall in this set.

After making this (and note that I made all these drawings by hand in Photoshop — I probably could have automated their production fully in Processing or something else with similar data visualization capabilities, but that would have taken quite a while to figure out how to do), I thought that a histogram would be another interesting way to see the data. It was relatively quick to make one from the previous chart:

histogram of the languages, colored by family, with length representing population

Here, each horizontal line represents a language, again at one pixel per million people, and there's no wraparound, unlike the earlier representation. You can see all the more clearly how huge Mandarin is compared to other languages. Besides that, because of the thinness of the one-pixel lines, it's somewhat difficult to get much out of this, besides an impression of the overall distribution of language size and some idea of which language families are concentrated where on the population spectrum — for instance, you can see Indo-European (green) dominant at the top, Austronesian (sky blue) common in the middle, and Niger-Congo (red) prevalent in the long tail. I tried one more representation, in which I sorted out the families from each other, making a histogram for each and stacking the histograms over each other. I also labeled the families, just for you:

histograms of language families by population

This satisfyingly clarifies a number of new pieces of information. Now the relative number of members (with at least a million speakers) of each language family is revealed by the height of each histogram. Indo-European and Niger-Congo have the most members by far; Sino-Tibetan, Afro-Asiatic, Altaic, and especially Dravidian are relatively tiny by this measure. Also, the volume of each histogram represents its language family's total speaking population for the languages covered; it's pretty obvious that Indo-European is the largest, although not nearly as obvious that Sino-Tibetan is second, which it is. It's also interesting to look closely at each histogram's distribution. Some of the families, such as Sino-Tibetan, Afro-Asiatic, and Altaic, have one member that's way larger than any of its other members (Mandarin, Arabic, and Turkish, respectively), whereas others such as Indo-European and Dravidian have more even distribution. There's probably a host of other conclusions that can be drawn from this, and it would be interesting to extend it further to languages spoken by less than a million people. Especially in that case, it would also be useful to see speaking population on a logarithmic scale. This is a good start, though.