Proposal:Structured Data

A feature request or bug related to this proposal has been submitted to Bugzilla under ID 30345.

See Category:Proposals with Bugzilla submissions for all submitted bugs.

Status (see valid statuses)

The status of this proposal is:
Request for Discussion / Sign-Ups

Every proposal should be tied to one of the strategic priorities below.

Edit this page to help identify the priorities related to this proposal!


  1. Achieve continued growth in readership
  2. Focus on quality content
  3. Increase Participation
  4. Stabilize and improve the infrastructure
  5. Encourage Innovation
See also: meta:Wikidata (2)

Summary

Structuring the Wikipedia data entries to make it possible to find them[1], for example to query "all European authors between 1910 and 1925" or "all A/V products released by Sony from 2001 to 2003".


Semantic is the key for Web 3. Do we want it? Do we want it inside the articles? There are a lot of web software that scan the web searching for information and try to understand it to extract knowledge. We can do better: we can insert the semantic inside the information.

Proposal

This is not a technical proposal. Today we have some extensions, like Semantic Media Wiki, and they are quite mature. But the question is: do we want semantic in wikipedia?

Add standard data "fields=value" to all articles. Fields must be standardized, value formats must be standardized, so that they can fit into a SQL database. For example:

Name=Johann Schiller
FullName=Johann Christoph Friedrich von Schiller 
Birthdate=1759-11-10 
Deathdate=1805-05-09                    Note: ansi date
Sex=M
Type=Person
Profession=Poet|Dramatist
Nationality=DE                                     Note: ISO 2-letter country code
PlaceOfBirth=Marbach, DE

etc.

The fieldnames should be standardized through all languages. The values should be standardized through languages, as much as that is possible, eg. by using ISO standard codes, so that every language can query for "country=Deutschland" or "country=Germany", whatever wiki you are in.

The data fields should be stored like the pictures in a language-independent name space, as they should - as much as possible - transcend language.

Motivation

To make it easier to extract similar data from multiple WikiPedia entries. To look for strings in specific contexts.

Try to answer to these questions: "Tell me the 10 highest mountains in Italy", "Who were the president in USA in 1950?", "Tell me the highest city in Argentina with a population greater than 1000 inhabitants", "Tell me the first afro-american woman that were olimpic champion of 100 m coming from Europe", ... With semantic we can do this. For example look at wikipedia:Wolfram Alpha: 10 highest mountains in Italy, USA president in 1950. You can asnwer to this questions also with Wikipedia, because you have the information, but it's very difficult! Today we have a lot of lists that can help you, for example wikipedia:List_of_Presidents_of_the_United_States but they are not the right tool. With semantic inside the wikipedia's article you can easily answer.

Tool like Wolfram Alpha has a very huge database and datas are bounded with semantic relation. Wikipedia has a lot of information, but not semantic.

We can also relate the semantic with the creation of shared external database.

Key Questions

  • Do we want semantic? Or we want simply a free encyclopedia?
  • How to introduce semantic? Inside the article? For example something like:
 
  '''Mr X''' was born in [[semantic:born_year:1950]] in [[semantic:born_location:California]] 
  and he's the husband of [[semantic:husband_of:Mrs Y]].
  

Ok, this is quite complicated for a newbie, but we can also use template to add semantic.

  • Just how far do you take this? Who would volunteer for the very dull task of tidying the existing data? Perhaps it could be partially automated by programs created by volunteers.
  • Do we want external database? This means that we can write something like:
 
  '''Proton''' is a [[semantic:be:particle]] with mass [[semantic:particle_mass]]
  

and the software go to the database, look for the table called "particle", inside this look for the entries called "proton" and give its "particle_mass". The advantages is:

    • shared database within all the projects
    • updated data

Potential Costs

People might be less willing to provide information if you force them to format it in structured form. Semantic is not very easy for newbies, but I think that we can't continue without semantic. Semantic will be the key concept of the new web and we need to evolve. Today there are some tools, like wikipedia:Google_Squared that scans the web (and mainly wikipedia) to extract semantic. Look here highest mountain in Italy. Google has not information, but it can get it from the web (and from wikipedia). Google software need to understand the information, it has to extract semantic from information. We can do better: we can insert the semantic inside the wikipedia articles.

References

  1. See Tim Berners-Lee on Linked Data

Community Discussion

Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal talk:Structured Data.

Want to work on this proposal?

  1. Kozuch 11:32, 24 November 2010 (UTC)