The status of this proposal is:
Request for Discussion / Sign-Ups
Every proposal should be tied to one of the strategic priorities below.
Edit this page to help identify the priorities related to this proposal!
- Achieve continued growth in readership
- Focus on quality content
- Increase Participation
- Stabilize and improve the infrastructure
- Encourage Innovation
- See also: Proposal:Data.wikimedia.org
There is a database called Cyc that contains "common knowledge" in a computer readable form. Things like "Fire is hot", "Water is wet", "Hitler was a man"...and simple rules like "If something is wet then it has water on its surface". A small proportion of that database is in the public domain as "OpenCyc"...but the vast majority of the system is locked away under proprietary licenses.
I propose the creation of a WikiKnowledge database would contain these kinds of facts - along with a logical reasoning and inference engines necessary to exploit that database. In time, it would be able to answer complex questions by searching that database and connecting facts and rules together.
This is obviously not a new idea - Cyc beat us to it - but what we have in our favor is a vast number of human volunteers to grow that database at Wikipedia-like growth rates - and people with encyclopedia-writing skills who are good at knowledge engineering (even though they may not know it!).
To start with the publically available OpenCyc data - to invent a means for anyone to add facts and rules - but to use Wikipedia-like methods to verify them - possibly add references (or at least tags to link them to Wikipedia articles or Wiktionary definitions) - and to create an AI system to reason using those facts and rules.
A system would run in background looking for illogical rules (if someone says "All dogs are cats" - then we'd hope that the engine could figure out that this was a lie by comparing it to other rules that have previously been entered)...some could be eliminated automatically - others would require the intervention of humans to sort them out ("Hitler was a human" and "Hitler was inhuman" are not necessarily contradictions - although the inference engine might think so).
Despite all of the information out there on the Internet (including all of Wikipedia), there is still no good, easy way to ask questions and have "the internet" answer them. It always boils down to searching using some decent search terms and then reading some number of web pages in the hope of distilling answers.
Not true, try Powerset and Wolfram Alpha --Fasten 13:26, 30 October 2009 (UTC)
- How hard is it to write the necessary software?
- Is there something OpenSourced that we could use or adapt?
- Can the "language" in which facts and rules are input be made sufficiently comprehensible for the uninitiated Wikipedian to be able to use them? Raw 'Cyc' language is a little intimidating.
- Could these facts actually be annotated onto the end of Wikipedia articles - kinda like we do cross-links to Wiktionary.
- Our Wikipedia category system is a kind of minimal fact system - we know that "Mini is-a-kind-of Category:Car". What sorts of facts that could be extracted automatically in that manner. Could we enroll the help of WikiProjects to decypher their standardized "Info-boxes" into auto-generated facts?
- There is already some kind of agreement or relationship between Wikipedia and Cyc ("Cyclopedia" or "DBpedia"??) - I'm not sure (in detail) what that entails.
- Would there be a large enough ratio of "enthusiasts" to "idiots" to keep the volume of misinformation at bay?
- How would we handle things like proponents of (say) faith-healing from insinuating "facts" into our science/medical knowledge base that were at odds with current scientific thinking?
The total size of the full Cyc database (including the parts that is not in the public domain)
The current "OpenCyc" knowledge base contains 47,000 concepts and 306,000 facts. This certainly would easily fit on one hard drive. The volume of data is fairly small. However, the computational effort in continually searching for inconsistancies and outright vandalism would likely be significant - also the computational cost of answering user questions might become large if the system became sufficiently popular.
One would therefore expect to need relatively little server storage space - but a fairly large quantity of computational resources.
Want to work on this proposal?