Proposal talk:Structured Data

Latest comment: 14 years ago by Torsch in topic Separate into two proposals ?

See also mw:extension:semantic MediaWiki. Hillgentleman 01:03, 20 August 2009 (UTC)Reply

Sounds Like an Infobox

I had alook at the Semantic Wiki test site and I think it is probably not yet ready for an encyclopedia. Data can only be encoded as relationships or as properties e.g. <Berlin> <is capital of> <Germany> or <Berlin> <has population of> <1000000>.

semantic relationships

In Wikipedia terms <Berlin> <is capital of> <Germany> would be <Page name><has relation to><pagename>.

But Berlin has been the capital of various different Germanies over the years (Hitlers Germany, East Germany, United Germany since the wall came down). These semantic relationships don't work unless each of these Germanies has a separate pagename. That might work for Germany but do we really want the tech dictating how we split and combine pages.

Also each of these <is capital of> relationships needs to be annotated with a start date and an end date. If this data is to be made machine parseable so it can be used all over the net then the source for the data should also be machine parseable or else all kinds of misinformation will be spreading with no easy way to check it. There doesn't seem to be an easy way to add this info in a way that makes it as easy to retrieve the basic relationship.

semantic Properties

For a wiki <Berlin> <has population of> <1000000> will be rendered <pagename><has property><Number> with a separate page for each property where the number type is defined (date, whole number, decimal etc.). Again this needs additional metadata which doesn't seem to be easy to add with the current Semantic Media Wiki software. In this case the date and the source of the info. Berlin's borders have changed over the years as well so you could want to define which version of Berlin the figures refer to.

possible way round these problems

Most of the info we are talking about is in infoboxes. See Proposal:A_central_repository_of_all_language_independent_data for a way I think these could be expanded to make them more machine searchable. If the infoboxes are moved to a separate datawiki then we can have multiple boxes - one for each version of an item - and then transclude these back into the relevant pages in each language wiki or into anywhere else folks want to use it. Since the info in the infobox is already more or less structured therefore it should be possible to use bots for some of the reformatting (I think.) Info boxes can be extended to add references to sources, start and finish dates etc. as required, under the control of the various project teams, before the info is moved. and the change can happen as fast or as slow as the editors want it to.
Does this make any sense? Filceolaire 12:57, 28 August 2009 (UTC)Reply

Discussion

Yes, I 100% agree with this proposal. This is where WP needs to be going in order to remain a leader in assembling knowledge online. Unstructured data and free text is good (especially since it's easy for newbies), but structuring data to make it "computable" is clearly the next step.

How do we assemble like-minded folks to work together on this? I think the more we can envision practical applications, the more compelling it will be. Perhaps related, I created w:Template:SWL as an attempt to start encoding semantic links, even before we have a technical solution to query those semantic links. Thoughts? Cheers, AndrewGNF 00:51, 27 August 2009 (UTC)Reply

Database

Well, here we have again the question whether Wikipedia aims to be an encyclopedia or a database. The 'database-movement' is quite strong. As a database only is possible by converting content to a pre-agreed data-format, a database is necessarily always more limited than an encyclopedia, and will always present a distorted view. The database and the encyclopedia are not compatible, and there is a very real danger that the 'database-movement' will eat up much of the encyclopedia. This is one of the major threats to the very core of Wikipedia, but not the only one. - Brya 08:06, 30 August 2009 (UTC)Reply

I think the beauty of structured data proposals (including but not limited to Semantic Mediawiki) is that it would allow Wikipedia to be both an encyclopedia and a database. We would have (as you point out) the breadth of an encyclopedia with the "computability" of a database. I think this hasn't even been an option in the past, but I think it's definitely achievable in the context of five-year strategic planning. My two cents... Cheers, AndrewGNF 15:41, 4 September 2009 (UTC)Reply

Response/New Post (09/25/2009):

Wikimedia is, first and foremost, a collaborative, participative information source, of which Wikipedia is only a part. I think the 'Database' argument is in line with the goals of Wikimedia, however, in order to be effective, a Wikimedia database must have two features: first, it must protect the functionality and integrity of Wikipedia as an encyclopedic knowledge source; and second, it must be highly user friendly, as any database both takes a lot of work to produce and must be relevant to the users. These reasons suggest a Wikimedia database would have to be produced primarily by users, with only 'organizing' functions executed with a 'top-down' approach.

To the point: Wikimedia is in a unique position to create a database software which will allow individuals to create highly relevant 'information maps' about the world as they understand it. An information map with categories of Individual, Organization, Environment, Task, and Tools/Methods would provide a meaningful context for any piece or set of information. Mapping connections would be useful in ANY field and even in ordinary life. Lawyers could use such information maps in developing and keeping track of cases; investigators could map crime scenes and relate events, people, and laws in a holistic picture, allowing them to quickly locate gaps in understanding and walk through connections that had been missed. Applications are endless: career counseling needs a map of economic activity to 'show' students what the economy 'looks' like and how all the occupations fit together - how much good does a couple of job descriptions do without an understanding of how the job fits together with other jobs and of what impact the job has? Public policy, law makers, interest groups, and the political public all would benefit GREATLY from a tool that would help us all understand the big picture in life; the benefits to clarity of direction among humanity would ALONE dramatically transform our ways of living; mapping our society would give us a basis for self understanding and awareness. Further, business, which is about making decisions and creating commerce, needs mapping to understand possibilities, processes, and to understand the ramifications of potential decisions. Emergency/catastrophe response and military operations need to understand key relationships at all levels - if such and such happens, what will be the effect; where are the highest value missions; what are the most central and relevant of tasks?

Maps can start out simple. Facebook like applications such as how do my friends know each other and what are the 'groups' in school? Modal information maps like 'everything relevant to a new student' where things like movie theaters (classified in tools, environment, and in 'organization' as entertainment), bus passes, library, dorm, engineering building and tasks like 'going to class,' 'doing homework,' 'socializing,' would all connect together and to a student's schedule perhaps.

I would STRESS again that the most important thing of all would be ease of use, the average person should be able to put together these maps graphically and with data and then SHARE and INTERLINK with other information maps manually, with assistance, and with full automation wherever possible.

Impact?

Some proposals will have massive impact on end-users, including non-editors. Some will have minimal impact. What will be the impact of this proposal on our end-users? -- Philippe 00:17, 3 September 2009 (UTC)Reply

I think impact on end users due to complexity of the semantic wikilinking code is the primary downside. For example, instead of seeing
Berlin is a city in [[Germany]].
users would see
Berlin is a city in [[located in::Germany]].
(or something similar depending on the exact technical solution.) Yes, it's a bit more difficult to read, especially for newbie editors. I think this could in part be solved by a better editor, and also I think it's not so difficult of a syntax to learn given all of the other wikicode people have to tolerate (tables, templates, etc.).
On the flip side, there is *huge* upside to this proposal. It would immediately be applicable to removing the "List of..." pages, replacing them with real-time queries. It would allow bots to parse Wikipedia data for knowledge without resorting to natural language processing. And ultimately, this is a big step toward the vision of the "Web 3.0" semantic web. My two cents... Cheers, AndrewGNF 15:51, 4 September 2009 (UTC)Reply

Separate into two proposals ?

In principal, one has two distinguish between two things

  1. The vision of being Wikipedia (and other projects) being fully semantic. This would allow automatic translation and automatic generation of data maps as mentioned in section "Database" below. But this is a very huge project which will need much time and discussion and will also lead to more complicated editing. This is the original proposal.
  2. One could start with making templates like en:Template:Infobox_Space_mission semantic. This would allow to

- Share data between different language versions of Wikipedia (see also Proposal:A_central_repository_of_all_language_independent_data).
- Replace List pages like "List of longest space flights" with automatically generated content.
- Use this data in normal text to avoid copy-and-paste / update errors. For example: Instead of writing "STS-1 launched on April 12, 1981" on could write something like "STS-1 launched on [[STS-1]].[[en:Template:Infobox_Space_mission]].LaunchDate{longDateFormat}". This is similar to what user Filceolaire says in section "Sounds Like an Infobox"

I do not know whether it is a good idea to make the second point a separate proposal as this is more short-term and the first thing is more long-term.

In Proposal:Data.wikimedia.org section Motivation/Existing projects there is stated that there is already a template which gets text from an (external) database.

Sorry for the bad formatting, but the "Editing Help" link leads to an empty page (Btw: where can I report this?)


--Torsch 10:35, 26 September 2009 (UTC)Reply

Return to "Structured Data" page.