Proposal:Data.wikimedia.org

Status (see valid statuses)

The status of this proposal is:
Request for Discussion / Sign-Ups

Every proposal should be tied to one of the strategic priorities below.

Edit this page to help identify the priorities related to this proposal!


  1. Achieve continued growth in readership
  2. Focus on quality content
  3. Increase Participation
  4. Stabilize and improve the infrastructure
  5. Encourage Innovation



This is a featured proposal.
This template automatically categorizes into Category:Featured proposals. All proposals marked by this template should be on this list too.
See also: Proposal:A central repository of all language independent data

Summary

This is an long term proposal: We collect immense amounts of data without doing much effort do make it processable. A natural extension of what we do is a project data.wikimedia.com, providing a host for data and/or information about external data with an orientation on processability by (computer-)programs. This would work hand in hand with Wikipedia projects as de:Wikipedia:WikiProjekt_Vorlagenauswertung/en [1] The steering by Wikimedia is needed because we have an hypertext environment which is adapted to put any amount of human readable text into context, but not to host data.

As this is a quite a big project, at this stage, the proposal is for Wikimedia to commit itself to this development, to host an open discussion later finalized by a feasibility study.

Nevertheless we will state some ideas where to go, meant as seed crystal for the development. Also, right now it is possible to approach existing projects of the Wikimedia-family collecting data to assess their needs.

Motivation/Existing projects

Extern up-to-date data displayed in Wikipeda articles
-> Proposal:Data-driven content

This section is concerned with automatically generated data incorporated in Wikipedia articles. A perfect example is the bot-generated collection of data at de:Vorlage:Wechselkursdaten used the template de:Vorlage:Wechselkurs to display the current exchange rate of certain currencies in the currency's article. Basically this (mis-)uses existing means to implement a query of an extern "variable":

{{Wechselkursdaten|USD}} 1.4294

A data.wikimedia-project worth of it's name would provide means to accomplish such tasks without resorting to a local bot-template-solution. An overview of metadata-templates in german Wikipedia can be found in this category: de:Kategorie:Vorlage:Metadaten.

In this example, the ECB exchange rates at [1] are given in :en:SDMX-ML, an XML protocol designed for times-series which is of interest to implement, as it focuses on transmission of metadata, multi-dimensional arrays, time series & administration data.

Serchilo
-> in de.wp, Search-Website (engl.), Main wiki page

This is no Wikimedia project, but of interest, because it involves multiple users generating a (media wiki powered) wiki under Attribution-Share Alike 3.0 with entries of this form[2]

<command>
    keyword  = dbt
    title    = DB-Auskunft Textversion
    language = de
    url      = http://reiseauskunft.bahn.de/bin/query.exe/dl?S={%Start|iso-8859-1}&Z={%Ziel|iso-8859-1}&start=1
</command> 

which generate automated deep links to relevant information from user queries, e. g.

db hamburg, berlin is translated into http://reiseauskunft.bahn.de/bin/query.exe/dl?S=Hamburg&Z=Berlin&start=1

and accesses the DB train schedule for Hamburg-Berlin.

From a higher perspective, this project incorporates part of the story, as

user manual query processing by server and wiki automated query extern page
(but) processing by user automated response extern page

so the automated response is directed to the user, but not processed, so serchilo has solely information how to access certain pages but leaves it to the user to retrieve and process the information.

Proposal

Main proposal as formulated in the summary: to host an open discussion about scope, feasibility and means of this new project; to acknowledge the importance of structured data and to include this topic in strategic planning.

But to start with something, the following is a rough sketch of what from my perspective would be useful to implement:

The basic idea is to enable human editors to provide processable metadata about freely accessible data/series elsewhere. The most difficult part would be to choose a design in a manner which is not too specific nor too general. Being too general in the covered data formats implies, that no intern processing of the actual data could take place, and the project would take a character of being merely an open directory of data, being too specific deprives the project of possible uses.

When we wish to look inside the data, first there is need to choose an adequate metaphor, so maybe we are talking about

  • static or dynamic files containing arrays of records in certain data types with array dimensions being subsets of entities like countries, time or unspecified

Next it is important not only to report data ("at http:www.example.org/data/sometable.xls is a file containing some information about word frequency in rumanian language"), but to provide a combination of:

  1. information what is exactly in the file resp. which data exactly is provided
  2. a possibility to further process this information, in case of static data a possibility to host the data locally
  3. and an integration to other Wikimedia projects

Following examples illustrate these 3 points:

one - is presumably solved by choosing or defining an appropriate XML standard;
two - we would asses if there is hope to automatically generate diagrams of data;
three - usability needs that data (arrays, records, diagrams etc.) can be accessed inside Wikipedia articles.

Key questions

  • To identify the potential addressees (contributors/users) of this project - this is important, because it would be a project which from the beginning would be directed to user with special knowledge.
  • What kind of data to cover?
  • What language/specification for meta-data?

Potential Costs

Potential cost is low until actual implementation, where implementation costs could be restricting, depending on chosen path.

References

  1. This project extracts all templates from the database dump. It intends to analyse the variable values contained in the templates and represent them in new ways, apply filters and do other useful stuff.
  2. http://wiki.serchilo.net/index.php?title=Deutsche_Bahn_Fahrplanauskunft&oldid=10445
Somehow connected proposals
Proposal:Weather tracking
Proposal:Data-driven content

Community Discussion

Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal talk:Data.wikimedia.org.

Want to work on this proposal?

  1. Cherkash 01:45, 2 November 2010 (UTC)
  2. .. Sign your name here!