Task force/Recommendations/Offline 4

Outline for Recommendation #4 - article selection

Question

How can we produce a variety of different page/article selections, to meet the varying needs of offline content users?

Strategy

A range of tools should be available, to allow selections to be made for offline releases

  1. The community is needed to flag articles for Importance and Quality.
  2. Hit statistics and other metadata are required to gauge article popularity and significance.
  3. Bots are needed to collect this information into tables.
  4. Article history, in conjunction with Flagged Revisions and WikiTrust are needed to pick unvandalised versions.
  5. An online selection tool usable by publishers and other users is needed to use the information above, categories, and other metadata to emit custom dumps, probably as XML.
  6. Such dumps will include templates and, optionally, media to be included in the final output, like HTML or OpenZIM.

Assertion: We need automatic tools to perform article selection

Fact: We cannot always use the whole project due to size constraints

Wiktionary and other projects are small enough to 'swallow whole' (at least in electronic releases) but the Wikipedias need a lot of trimming, especially if images are included. Book releases present an even greater challenge.

Fact: We can use quality, importance and popularity measures via the WP:1.0 project to aid selection

See en:Version_1.0_Editorial_Team and en:Version_1.0_Editorial_Team/Assessment and the main index at en:Version_1.0_Editorial_Team/Index. Also see fr:Projet:Wikipédia_1.0/Index and hu:Wikipédia:Cikkértékelési_műhely/Index.

On these projects bots collect and tabulate metadata. WikiProject teams still need to tag articles on a Quality scale, and preferably on an Importance scale also. However, because assessment work is decentralized and performed by subject-experts, and supported by bots, this approach has proved scalable; WikiProjects have now manually assessed more than 2 million articles on en, 370,000 articles on fr, and over 50,000 on hu.

Once the selection has been made, another tool needs to use category information and other metadata to generate indexes.

Assertion: We need automatic tools to perform 'best-version' selection

Sub assertion: Vandalised versions of articles are a problem

Fact: Vandalism is a problem on Wikipedia

In the original selection for en:WP Version 0.7, we selected only versions from account holders, yet we still found approximately 200 examples of significant vandalism in our offline collection. For example, the article on a popular black comedian had the following opening statement "Suc* my a** kids im a ni**er and my real name is charles reed pulk" (asterisks added by myself).

Fact: A vandalized version of an article in an offline release cannot be corrected

As a result we might potentially send a page full of obscenities to 100,000 schools.

Fact: Manual article checking is difficult and slow

For the "Wikipedia for schools" collection, a whitelist greatly helped in version selection, but some examples of "deep vandalism" still crept in. Extensive checking and rewriting by volunteers was used, since this collection was specifically aimed at children. A similar whitelist approach was used with the German Wikipedia 1.0 releases. With the en:Wikipedia 1.0 releases, where a whitelist was not used, a search was performed for "bad words" and vandalized versions were then identified and corrected manually. This approach is extremely tedious and labour-intensive, and delayed the release by six months.

Fact: There are projects like Wikitrust and Flagged Revisions that should be able to automate version selection in the future

See http://wikitrust.soe.ucsc.edu/

User:Walkerma has had detailed discussions with Luca de Alfaro from the WikiTrust project, and automated version selection is very likely to be possible. This approach would allow every article version to be given a trust value, based on the sum of the contributions in it, and this would allow us to avoid picking vandalized versions. On January 10th, 2010 this prospect was reconfirmed by Prof de Alfaro as very likely, and an article version dump was completed on that date.

An alternative approach would be to use Flagged Revisions. In the German Wikipedia, and others where the Flagged Revisions extension is well established, this may prove to be a popular method for version selection.

Fact: Rapid version selection is needed in order to provide regular selection updates

See below. Clearly, we cannot be doing monthly updates if each version selection requires six months of manual checking.

Assertion: Releases of offline selections will need to be carefully structured

This assumes that we begin to produce a variety of article/page selections from Wikipedia.

Sub assertion: We should use a clear system of identifiers for each release and update

For example, if we make a selection of "Top 1000 biographies" for 2011, we could use simple, clear release numbers, such as "Top 1000 Biographies, Version 2011.01" for January 2011, with perhaps monthly updates that read "2011.02" for February, etc. The details can be discussed, but a standard format should be used - preferably across all formats. One number should indicate the selection of articles (in this case, perhaps updated once a year), the other number should indicate the selection of article versions (presumably updated once a month). Note that automated selection of both articles and article versions is a prerequisite for this type of regular, structured series of releases.

Sub assertion: We should offer frequent version updates

If WikiTrust or Flagged Revisions allows us to produce lists of low-vandalism versions rapidly, we should use this to update the article version selection often.

Fact: Even for offline use, end-users still want to have a current selection

See this example.

Fact: Content can become stale

Elections, wars, terrorist attacks and "acts of God" all happen with great regularity, and version updates should help keep the content "fresh".