Task force/Recommendations/Local Language 3


Outline

Question/Problem

For local language projects to grow it is important for editors to be able to interact with the MediaWiki software in their own language. It is also important that the software supports the characters of the local languages and that it support right to left script.

Strategy

A three stage process to get the localization done for all languages with one million or more native speakers is

  1. Run a campaign to get a higher number of volunteering translators to translatewiki.net.
  2. When the number of translated messages starts to plateau, put money into bounty rallies and pay-per-message solutions. (Estimated cost: $100,000 to get all messages translated with these methods alone)
  3. Finally, when neither of these solutions are sufficient to translate the remaining messages. Consider whether it is worth to hire professional translator to get the last messages translated. (Estimated cost: $1 million to get all messages translated with this method alone)

To solve character set issues it would be a good idea to cooperate with already established open source communities that tries to solve the same issues for other systems. More documents about African languages can be found at http://www.africanlocalisation.net/documents, especially the following documents seems interesting; Characters needed for African orthographies in Latin writing system - http://www.africanlocalisation.net/content/characters-needed-African-orthographies-Latin-writing-system

The Task Force has not had time to research what the actuall issues with right to left support is, but it is however important to give right to left languages the same support as left to right languages.

Important note about paying for translation: Even though paying for translation might be an effective way of getting the translation done it is important to realize that volunteers are likely to stop translating them self if others get paid for the same work. Especially hiering of professional translators is likely to discourage volunteering translators. The bounty rallies and pay-per-message methods are less likely to discourage volunteers because everyone has a chance to have a share. But it is still important to ensure that everyone has the same chance on the share then. A monetary reward could also decrease the intrinsic motivation as explained in the Wikipedia article overjustification effect.

Important additional note: One more thing that has been brought forward is the necessity of development/availability of internationalization tools for media content. The Task Force has not had time to research what these issues are or how they can be solved, but one thing that has been mentioned is that SVG->PNG conversion gives strange results for some characters. That the software is internationalized is of course as important as localization.

Assertion: The MediaWiki software could be fully localized for all languages with one millon or more native speakers for $1 million, $100,000, or cheaper.

The statistics on localisation of the MediaWiki software system messages consists as of 25th of December of a list of 323 different localisations. MediaWiki defines 362 localisations. Some localisations are excluded for various reasons, the most commons is that the localisation definition has been created for convenience reasons or is still present for backward compatibility reasons.

Among the languages the amount of localisation varies from 0% to 100%. The number of MediaWiki core non-optional system messages is 2369, and the number of messages for extensions used by Wikimedia is 2727 (per 2009/12/25). The sum of messages to be translated is (2,369 + 2,727 ) * 323 = 1,646,008. The average localisation percentage of MediaWik core is 46.87%. For MediaWiki extensions used by Wikimedia it is 20.61%[1]. This means that about 1,105,757 or 67.1% of the total number of messages has not been translated.

Siebrand at translatewiki.net has estimated that 100 messages could be translated per hour by a professional translator. To translate the approximately 1,100,000 messages that at the moment is untranslated would therefore take about 11,000 hours. If one counts the number of languages with more than one million native speakers on this list there are 275 such languages. Assuming that the percentage of untranslated messages is similar to that of the list of the 323 languages this means about 940,000 untranslated system messages. (Probably the number of untranslated messages in these languages are lower than this because it is the uppermost 275 languages that has been filtered out, which is likely to be reflected in a higher amount of finished translations.) With the same translation speed as above this means about 9,400 translation hours. There are several ways to get these messages translated.

  • Translators could be hired. An estimate from Siebrand is that the cost for hiring translators would be $85/hour plus 20% in overhead. This would mean that all the messages could be translated for 9,400 hours * $85/hour * 1,2 = $958,800. This is about 13% of the goal of this years fund raiser.
  • At translatewiki.net translation rallies has been arranged where translators has been awarded with a share of €1000 if they translate a certain minimum (500?) of system messages. This approach has resulted in paying about $0.08/message. If it is possible to translate all 940,000 messages by this method with the same effectivity this would be a method that costs about $75,000. Or once again comparing to this years fund raiser, 1% of this years goal.
  • A third method to get translation done could be to let translators register with a pay pal account and pay $0,1/message they translate. With the translation speed of 100 messages an hour that Siebrand has estimated this would mean that translators could earn about $10/hour. For translators in wealthy countries this would not be a very high pay and could be seen more as encouraging volunteers for their work. In some less wealthy countries, probably in many that currently are under-localized too, this amount would however probably be a quite high hourly pay. This would mean that for quite well localized languages where volunteering is more likely to happen the money are more of an encouragement to do volunterring work, while for the less localized countries where voluntary work is less likely to happen it is more of an actual wage. It is however a problem how the quality of the translations can be assured with this method. To cover transaction fees and other overhead there could be a minimum threshold of translations that needs to be done to get payed and letting the first translations be unpaid to cover for such expenses. The cost of this method would be $94,000, or about 1,3% of this years fund raiser goal. The exact amount $0,1 could however be changed, giving another prize.
  • A fourth method is to run a massive campaign at all Wikimedia projects that highlights the need for localization to be done. Siebrand could work together with WMF to arrange such a campaign. One way to run this campaign could be to use the fund raiser banner space to promote localization. Also make the local chapters promote localization. The cost for this method would be very low.

Assertion: There are many languages that has character sets that differs from the latin characters.

Latin characters are well supported by MediaWiki but the character set of every language should be supported. For example, according to http://www.africanlocalisation.net/sites/default/files/AtypI08%20African%20fonts.pdf the African languages largely uses Latin alphabets, but with a variety of character set extensions. Other character sets that are used in African languages includes Arabic script, Ethiopic, Tifinagh, Nko, Vai, Kikakui, Bamum and Mandombe. There probably are even more character sets that needs to be supported if one considers not only African languages.

If one just adds together the number of speakers of the languages given as examples in the link above that uses extended Latin alphabets gives about 140 million speakers (slide 12-13), showing that coverage of different character sets is important.

Assertion: To solve character set issues it would be a good idea to cooperate with already established open source communities that try to solve the same issues for other systems.

ANLoc (http://africanlocalization.net) is one such community.

Some open source packages for African fonts are

  • Charis SIL and Doulos SIL
  • Gentium
  • DejaVu fonts
  • Liberation fonts (in progress)
  • Droid fonts (in progress)