Open main menu

Proposal talk:Distributed Wikipedia

Contents

discussion / comments from Proposal:Distributed_backup_of_Wikimedia_content

the creation of peer-to-peer distributed infrastructure, where the current wikipedia servers are just another node participating in the network, and the placement of GUI-front-ends on top of the nodes, encourages wikimedia pages and associated images to _implicitly_ be "automatically backed up", by virtue of them being "near-permanently cached".

importantly, the combination of both a distributed "back-end" and nearby (local or loopback 127.0.0.1) front-ends *automatically* makes the choice of which images (and pages) to be cached a very immediate, obvious and relevant one: absolute priority should be given to the pages being requested by the users of the front-ends.

a further optimisation, for the convenience of the users, could be an enhancement of the GUI, whereby a query is first submitted ("Paris" for example) and the user then clicks a button "please slurp every page on the top 50 hits into the local cache". over an exceedingly tediously slow link, the back-end is then instructed to give some priority to obtaining these pages (and images). over a *disconnected* link, the back-end creates a list of queries which will go onto a USB memory stick (or other media), such that at some point in time (once a week, or once a month), the USB stick will be shipped abroad (or to a nearby university with internet connectivity), the USB memory stick inserted into a node, the queries run, and the responses place BACK onto the USB memory stick, and shipped back to the outlying remote area. (no, this is not a joke: despite the recent slashdot coverage of the south african ISP versus Carrier Pigeon, the Carrier Pigeon with a USB memory stick, backed up by a p2p protocol, is actually a damn good idea!).

discussion / comments from Proposal:Distribute_Infrastructure

"Distributed computing generally works by trading off efficiency and speed for lower hardware costs. Jobs have to be queued up, everything has to be done multiple times to make sure the results are consistent, and there may be significant delays between download -> work starting and between work ending -> upload. For most of the processing-intensive aspects of MediaWiki (parsing articles, generating special pages), speed is critical. We can't have people going to a special page and being told "Your request has been sent to the grid, come back in 30 minutes for the results." Mr.Z-man 16:35, 17 September 2009 (UTC)"

Lkcl 15:26, 30 September 2009 (UTC) - every single one of these sentences is completely off-base, thus making it _incredibly_ useful, as it highlights exactly how _not_ to go about creating a distributed wikipedia. so, thank you! answering these one at a time:

"Distributed computing generally works by trading off efficiency and speed for lower hardware costs".

under the distributed wikipedia proposal, users would install a peer-to-peer service which acted as a local cache and as a "file sharing" i.e. "wikimedia sharing" server. this server would also then have an HTTP service which provided EXACTLY (ok, mostly exactly) the same web site interface as is currently presented by the wikipedia.org servers (with a big red mark at the top saying "this is a local server, dummy!"). thus, any individual sees an *increase* in speed, and incidentally provides additional resources to the distributed network. the runaway success of this concept can be seen at both google and (better example) skype, both of which make use of distributed computing in very different ways.

"Jobs have to be queued up, everything has to be done multiple times to make sure the results are consistent,"

under the distributed wikipedia proposal, that would be a background job / part of the role of the peer-to-peer service which would in no way impact on the role of the service as a "local cache". one VERY important job of the service will be, just like any other distributed file service, to keep and still make available the "local edits / changes". about the only thing that would be a good idea would be to mark wikimedia edits that had not yet been "accepted into wikipedia.org servers" in a subtle colour, to indicate that they are still "local" or, in the terms of this document, "Non-Authoritative".

"and there may be significant delays between download -> work starting and between work ending -> upload."

yep. so what? the point of the local cache is precisely to mitigate the effect of such delays, and the point of the peer-to-peer communications infrastructure is to find the most local / fastest nodes from whom to push / pull "wikimedia pages". if one set of users start overloading a distributed node, those users can go "this is ridiculous: let's install another distributed node". right now, there's absolutely ZERO way in which ordinary people can speed up and contribute overall to the running of wikipedia (and no, donations of money don't cut the mustard if it's ploughed into centralised infrastructure, thousands of miles away from where the wikipedia pages are being read)

"For most of the processing-intensive aspects of MediaWiki (parsing articles, generating special pages), speed is critical. "

absolutely. users who cannot tolerate the slow speed should install themselves a faster local distributed node.

"We can't have people going to a special page and being told "Your request has been sent to the grid, come back in 30 minutes for the results."

absolutely. fortunately, this proposal is advocating the EXACT opposite of "come back in 30 minutes", and is advocating the use of local-cacheing and local marking of submissions which have not yet been submitted by the back-end service.

overall, i think it would be a good idea for anyone unsure of what is required to look up how google "wave" works. "edits" are sent out in the form of messages, which are distributed out (unfortunately it looks like they're distributed to centralised google servers, which, if that's the case, would be a terminally bad idea - but my suspicions have to be confirmed by examining the code that gets dumped upon us from on high from the gods that google try very hard to not make themselves out to be).

the distribution of these messages ("edits") are then acted on, and they say things like "add this paragraph into the test HERE please". oh look - does that sound like a wikimedia page edit? it does, doesn't it. how funny, that. in the case of the message being received by the wikipedia.org servers, it would be treated there-and-then EXACTLY AS IF SOMEONE HAD EDITED THE PAGE THROUGH THE WIKIPEDIA HTTP INTERFACE RIGHT THERE RIGHT THEN. including performing ip address (and/or digital signature) checking and ALL OF THE OTHER CHECKS THAT THE WIKIPEDIA SERVICE DOES RIGHT NOW. in this regard, ABSOLUTELY NOTHING CHANGES (except for the addition of GPG or other digital signature checking as an alternative to ip address checking, to cater for long-term offline contributions such as USB memory stick contributions via postal service. see git's "SMTP delivery" system for potential implementation details).

so, after the message ("page edit") is received, it will be digitally signed. this will generate ANOTHER message, indicating that the page edit is considered to be "Authoritative" as far as wikipedia.org is concerned (of course, the "local edit" was "Authoritative" as far as the _local user_ was concerned, but that's another story).

one of the recipients of the "authoritative" signed change will, eventually, be the distributed node on which the user made the initial contribution. in their case, the edit is already there. special care will need to be taken to ensure that the edit is correctly merged. interestingly, this exact issue has already been solved, technically, many many times, by technology known as "source revision control systems" such as git, mercurial, bazaar, subversion etc. so it's not like there's any really significant extra work needed to be done.

overall, then, this entire proposal is a matter of designing the API and then fitting together some components and projects which already exist, using proven techniques which already exist, of which the most well-known ones are google wave, bittorrent, git and other such technologies from which ideas and possibly even free software code can just be nicked at random.

practical implications for databases / world-wide identification of articles and revisions

everything needs to be marked with a GUID. 128-bit might well be sufficient, and would fit with the MSSQL GUID support (which is extremely useful, and free software databases should learn from it and adopt it). failure on the part of free software database to adopt GUIDs means that the burden of implementing 128-bit GUIDs is placed onto developers (as 16-byte UCHAR or other suitable indexable field specification).

the importance of deploying GUIDs on articles and revisions - everything in fact that is to be distributed - cannot be underestimated. the whole point is: they're world-wide unique, making it possible for disparate and remote contribution sources to be mergeable.

in this regard (contribution merging), yes google wave might have some insights and implementations that will be of benefit - but the implementation will _have_ to be made available in an acceptable free software format.

thoughts

  • once the API is defined which allows inter-database-node peer exchange of pages and contributions, i think you'll find that free software communities _very_ quickly will provide alternative implementations (such as the git idea). the key to that will be doing the drudge work to define the API, and to provide a reference implementation [that, obviously, provides access to and fits with the existing wikipedia infrastructure]

Google Gears / HTML5

a potential mini-implementation, especially if the API is an HTTP-based RPC mechanism, could involve offline storage of wikipedia pages inside web browsers HTML5 SQL storage and/or google gears. GWT has support for gears; pyjamas has ported the gears SQL support but not the offline storage (yet).

Impact?

Some proposals will have massive impact on end-users, including non-editors. Some will have minimal impact. What will be the impact of this proposal on our end-users? -- Philippe 00:08, 3 September 2009 (UTC)

  • Short term: practically invisible, better responsiveness over time (potential), lower chance-of-failure (potential)
  • Medium term: less reliance on donations. Potential realized.
  • Long now: Improved survivability of the encyclopedia.


Also: Define "end-user"? I would like to see everyone touched by the diverse wiki projects as (potential) prosumers, rather than passive end users.

I don't fancy the "prosumer" idea. Mostly I go to Wikipedia to read, hence act as an information consumer, an end-user. Sometimes, I add or change or delete text. An information producer. One person. Two use cases. So what? (martingugino 01:58, 2 October 2009 (UTC))

XMPP / (+wave)

This is a perennial proposal.

Some things have changed in the mean time, however.

We could actually just hook up with the (then) existing Wave architecture? --Kim Bruning 10:35, 3 September 2009 (UTC)

will Wave actually exist (in time), in implementations outside of google [pseudo-proprietary] servers / services? i.e. will google FORCE people to utilise google servers, just like they have done with google "chat", by releasing the client but not the server implementations, and designing the client specifically around client-server protocols (which get called p2p) rather than FULLY peer-to-peer independant protocols which have absolutely zero dependence on centralised infrastructure? Lkcl 23:08, 30 September 2009 (UTC)

Non-public information?

Much of the database could not be distributed fully to untrusted parties. The revision, logging, ipblocks, recentchanges, archive, user, and watchlist tables all contain potentially sensitive information. Mr.Z-man 16:39, 17 September 2009 (UTC)

I would tend to say that such information is implementation-specific to the servers behind the wikipedia.org domain, as part of creating "authoritative" nodes. If a particular distributed node chooses to store such information, _great_. if other nodes choose not to, that's absolutely fine. what will happen when information from an "offline" / "non-authoritative" (viz "distributed") node "pushes" data up to "authoritative" nodes e.g. wikipedia, it will be UP TO THE RECIPIENT SERVERS to decide whether to create a) authoritative/unique revision numbers b) perform logging c) perform ipblocking d) maintain / merge recentchances e) perform archiving f) maintain / merge user lists g) provide a watchlist service. Lkcl 11:45, 22 September 2009 (UTC)
Eh? My point was that many of the database tables contain information that we cannot give to people who have not identified to the foundation, as per the privacy policy/access to non-public data policy. Its not about whether people want to accept certain kinds of data, its that they can't. Mr.Z-man 20:18, 30 September 2009 (UTC)
Lkcl 23:21, 30 September 2009 (UTC) i still see no conflict, here. two reasons.
"No Conflict 1") anyone wishing to set up their own version of wikipedia, which does NOT have such "identification policies", will be at liberty to do so, and will be able to one-way suck/merge "standard" wikipedia into its own WITHOUT overloading the wikipedia.org servers, because their "new" service will simply be a part of the peer-to-peer network, which, by its very nature, will be impossible to "overload". in this way, you will quite likely see an exponential explosion in the use and deployment of wikipedia technology, beyond even the wildest imagination of its founders.
"No Conflict 2") it's why the "digital signatures" are an integral part of the proposal. see "Debian GPG Keyring" and other GPG signing. any server which is "not part of the wikipedia.org data centre" (distributed or otherwise) MUST perform a check on the digital signature of any changes / contributions. failure to do so, by the wikipedia.org servers, would be a violation of the foundation's policy. the digital signature will be REQUIRED to link BACK to an account which was created ON the wikipedia.org servers, thus completing a mathematically unforgeable link, and thus providing the foundation with satisfactory fulfilment of its criteria and policies for "identification to the foundation".
p.s my answers above are based on a specific interpretation of your point, which is, unfortunately, still ambiguous. it's late, so i would ask you to clarify precisely and unambiguously, preferably by providing a specific example with made-up data, that illustrates your point. "certain types of data" is far too non-specific, as is "many of the database tables", as is "people". which people? the users? the contributors? the founders of wikipedia? the database administrators? to whom are you referring? Lkcl 23:26, 30 September 2009 (UTC)
Lkcl 09:07, 1 October 2009 (UTC) answer number 3: i don't believe you're getting it. the public APIs being proposed have nothing to do with the internal decisions made by wikipedia to keep and track quotes sensitive information quotes. the public APIs being proposed are absolutely everything to do with what can ALREADY BE SEEN through the EXISTING wikipedia.org HTTP interface. anything that is beyond what can ALREADY be accessed via the existing HTTP interface must be considered to be "implementation-specific" to the wikipedia foundations's SPECIFIC implementation of the APIs being proposed.


(moved point about "full functionality" to a separate section)
(moved z-man's clarification answer into separate section, "foundation requires physical identification")

Privacy Concerns

Lyc. cooperi raises the issue of privacy. an (ongoing) discussion is below, and the main points are listed here. please continue the discussion and i will update the bullet-points on an ongoing basis:

  1. a random p2p node somewhere might be logging data for malicious purposes. answers to this include: a) don't use the p2p service b) design the p2p service with anonymising carried out. possibly the easiest and simplest way to do this is: drop the entire DW p2p infrastructure on top of gnunetd. problem is solved.
  2. lyc also points out that many people actually don't give a stuff about security and privacy, and just get on with their lives. whilst this is good, some countries actively censor and imprison their citizens. the answer to this is: utilise the offline capabilities of the DW proposal.
  3. lyc also points out that really, a keyserver service should be utilised / available. this is implicit in much of the discussions (mention of "digital signature verification") but needs to be made more explicit. also, the main wikipedia.org site really needs to consider allowing users to add one of GPG public keys, ssh public keys, openid and more to the authentication methods allowed, either as internal wikipedia services _or_ as external to wikipedia _or_ both.
  4. lyc also would like to see the continuation of anonymity, via the DW protocol. for that, the answer is that a p2p node that utilises gnunetd's anonymising service would be a perfect combination. even to the extent of real-time relaying the CAPTCHA images as a challenge-response to check that there really is a user at the end of this anonymous chain.

Z-Man already had your point - he was saying private information shouldn't even possibly touch unnecessary Non-WMF servers. You don't need to go around accusing people of "not getting it." Anyway, I'm not the best at computers still and I like your proposal, but I'll play devils advocate: everyday folks don't care enough about privacy. And that's exactly the problem because everyday folks are among those who need privacy. And while you can say "everyone should just connect to a privacy-respecting server which uses a version of wikipedia which trashes the log data" there is no way to know For Sure that that server isn't logging its data/packets

Lyc. cooperi - there has been research done into p2p network protocols that provide - and provide proof of - anonymity. there has also been research done where IP addresses can be harvested from e.g. bittorrent and it is these insecurities that the other research addresses. ultimately, if it _really_ matters: don't use the p2p service - use the standard web site! Lkcl 19:54, 2 October 2009 (UTC)

That can be a problem. What might a node which is collecting information on usernames connected to IP address connected to contributions do? Sell it to marketers who can then correlate your IP address to the content and then target you on websites you visit? They might give it to a strongarming government that doesn't like the bad things people have wrote about them. Some server hosts might be as or more scrupulous than the WMF, but I guarantee you not all of them will. Heck, governments might run hundreds of their own wikimedia computers just to catch possible IP address-to-user (and thus IP address-to-contributions) connections of users who connect through them. Once you take technology out of WMF's hands you take security out of their hands as well.

Lyc. cooperi, you raise good points, and it's answered by this: any user may install their own back-end p2p service, restrict it so that it only submits direct to wikipedia (on their behalf); does not interact (using the p2p protocol) with ANYTHING other than the wikipedia p2p-enabled servers; they also configure it so that they are the ONLY people who can access their locally-installed back-end p2p service. in this way, they get the benefits of cacheing, and of off-line access _and_ will still be able to edit pages as if they were submitting the edits direct to the wikipedia.org site. ultimately, however, if this is simply all too complicated for them (which would be a failure on the part of the DW project: it should be _absolutely trivial_ to install and configure), they can just go use the existing http wikipedia.org web service. Lkcl 19:54, 2 October 2009 (UTC)

It is true that this layer of insecurity might be irrelevant: in some countries, ISPs already sniff user packets, or Feds sniff central routers, I trust we are not so cynical that we make it okay for a distributed network of strangers to do the same to all wikipedians across the globe in regards to their wikipedia contributions.

Lyc. cooperi - yyyup :) most people don't give a stuff :) you only have to look at how many people sign up per day for facebook to realise that. the "convenience" and the simple fact that the sheer number of people involved is just ... overwhelming, makes something of a mockery of this whole "big brother" thing. *however*, there are in fact countries where they are _actively_ putting people into jail for accessing wikipedia, and _actively_ censoring. for these circumstances, the "offline" capabilities of the DW proposal is perfect: just smuggle a memory stick in and out of the country, or between associates, where the computers being utilised to edit and view pages are UTTERLY network-isolated. Lkcl 19:54, 2 October 2009 (UTC)


Further, in defense of WMF: WMF could buckle at any time to higher powers, or decide to shut down the site, or the feds could come in and try to crash it. But that's what backups are for. I know at least one person who keeps a backup. Most big drives today can handle a backup of the entirety of en.wikipedia article content and history. Maybe not talk pages and media files, so there's your talking point I guess. A good request might be that WMF periodically has another organization (one also disconnected from it with a self-interest in behaving in a moral fashion (such as EFF)) back up its data in entirety.

Lyc. cooperi - now take that one step further: let _everyone_ in the world be part of the "backup". redefine "anyone who has a locally cached copy of the wikipedia pages that they regularly view on a day-to-day basis" as being "part of the backup". _now_ you have an incredibly empowering system, which is now completely impossible to shut down, and completely outside of the control of "higher powers". you only have to look at how many large corporations have spent _vast_ amounts of time trying to attack and shut down skype to appreciate the power of truly distributed communications. Lkcl 19:54, 2 October 2009 (UTC)

Finally, if you want to decentralize control, use another keyserver than wikimedia. Tha'sall. Prove all this wrong if possible. Cause it sounds like a cool idea (if it doesn't mean an end to anonymity). -Lyc. cooperi 22:15, 1 October 2009 (UTC)

the keyserver idea is a good one, and i hadn't explicitly mentioned it, but perhaps it is a good idea to. i'd implicitly mentioned it by talking about "checking digital signatures". implicitly, that _automatically_ means that you need to go out and download and verify a users' GPG public key from somewhere (or you need the user to have uploaded it to wikipedia.org via an extended / additional feature). other options include the use of openid. etc. but: yes. wikipedia's infrastructure would need to start thinking beyond "a uthername and uh patthwuhhd" and start getting into the 21st century for this whole DW thing to work. Lkcl 19:54, 2 October 2009 (UTC)
the answer to the anonymous question is: that a p2p node that utilises gnunetd's anonymising service would be a perfect combination. via HTTP, there already exists the means to anonymise IP addresses by using "TOR"; the use of a combination of gnunetd and the DW p2p protocol is no different. yes, you would still need to tunnel a CAPTCHA all the way through, in real-time, from wikipedia's servers right through gnunetd and the DW p2p gnunetd-enabled nodes, right through to the user's front-end and back again: this would be dreadfully slow (in order to help maintain anonymity) but that's just the price that any user wishing to remain completely anonymous pays. Lkcl 22:02, 2 October 2009 (UTC)

Foundation requires "Physical Identification"

User:Mr.Z-Man kindly raises the issue that the wikipedia foundation requires "physical identification" to be submitted to the office, before anyone can be allowed access to "the data". the highlights of the discussion below (which is on-going, so please continue it and i will continue to update the bullet-points here) is:

  1. the office requires faxed copies of passport or drivers' license before "access to data" is permitted
  2. the debian GPG keyring (comprising nearly 1000 developers) is MUCH more stringent, and involves mutual GPG key-signing in the PHYSICAL PRESENCE of two people, who must show each other their passports. such a system could be leveraged to overcome (nay, overwhelm!) any objections.
  3. the "opencertificate" process is somewhat similar and could also overcome objections
  4. the very existence of wikipedia "robots" rather makes a mockery of the whole idea of requiring "physical identification" in order to get at "the data".
  5. making the "Distributed Wikipedia" a "robot" i.e. using HTTP GET and POST to the wikipedia.org web site php pages would be a complete pain to develop.
  6. authentication would be troublesome if "DW" was a "robot". unless the HTTP GET and POST wikipedia.org php pages interface were extended to accept digital signatures. however: OpenID _might_ cut the mustard.
  7. there really isn't that much of a difference, really, when you get down to it, between HTTP GET and POST getting at "the data" and a p2p distributed service getting at "the data". handing over 4 terabytes of data on hard drives: yes, i can understand: "physical identification" is kinda a good idea. putting a p2p distributed service, with access to pretty much exactly the same data as the HTTP GET and POST via wikipedia.org into the same frame as "handing over 4tb"? naah.


"You still seem to be missing my point entirely. Wikimedia will not distribute data to users who have not identified to the foundation. This isn't about digital signatures, this is physical identification - a copy of a passport or drivers' license faxed to the office. Mr.Z-man 22:39, 1 October 2009 (UTC)"

ahh, thank you z-man - you didn't mention exactly what "physical identification" means, nor its significance. now that's been mentioned, i now understand. ok. 1) i've mentioned it twice, but i'll mention it again: are you aware of how the debian GPG keyring works? basically, debian developers PHYSICALLY identify each other before signing each others' GPG keys, by showing each other their passports. EVERYBODY does this. the more people that obey the rules, the less chance there is of two people having disobeyed them. then, also: look up "open certificate". in order to become a certificate signer, you have to get i think it's something insane like ONE HUNDRED people to sign your GPG key (that's 100 times that you have to show your passport to 100 different individuals) or you can get three bank managers to sign your GPG key - something like that. then, once you have done this, _then_ you are authorised to sign people's SSL certificates, for a period of one year i think it is. at the end of one year, i can't remember but i think you're either not allowed to be a cert signer ever again, or you have to go through the whole process again. so, if "physical identification" is required, then you can just hook into either of these two systems and you will then have several hundred possibly several thousand people who immediately qualify as having been "physically identified". 2) the data being discussed is ALREADY "distributed". anyone can write themselves an HTTP "crawler" of the current HTTP wikipedia interface. anyone can write a service that does HTTP GET and HTTP POST via the php pages in front of the "data". doing so would give you EXACTLY what is required (bar the GPG digital signatures as being a way to not require user password to be pushed over the service in-the-clear), but would be a complete pain in the ass to develop. there is in fact precedent for doing exactly this: wikipedia robots. Lkcl 19:23, 2 October 2009 (UTC)
Is the problem, then, how to avoid a centralized log-in to wikipedia? I didnt get your point about faxing my passport. What will they let me do if I fax them my passport?(martingugino 02:34, 2 October 2009 (UTC))

Full Functionality Issues (User Preferences) via API?

One could not make a mirror with full functionality using only the data that's available for non-authenticated requests from the current interface. You could get most of it, and it would be sufficient for logged-out users who want to do nothing but browse articles. However, for logged-in users, the appearance of articles (skin, math formulas, image thumbnails, etc.) is often based on user preferences. User preferences are considered non-public data. Users have access to their own preferences via the normal interface, and that's all. There is no GUI/external API to access the preferences of other users, which would be necessary for displaying pages to users based on their own preferences. But users couldn't log in anyway, because password hashes are also considered private data.

ok. this is an excellent point. fortunately, i've studied authentication systems such as microsoft's NT Domains protocol, and to some extent kerberos, so i'm aware that it is possible to do "pass through" authentication. e.g. challenge-response, where the challenge is passed from the authentication server "through" all the intermediaries, until it arrives at the user's machine. the "response" is created, and shipped all the way back. the challenge-response algorithm can even actually be executed in the user's web browser (as javascript). so the challenge would be received via AJAX; the challenge "munged" with the user's password, and the response shipped back using AJAX to whatever web service; then, the response is bounced back through the p2p network all the way back to whatever server issued the challenge. this _entirely_ gets you round the problem of having to expose password hashes. actually: it's just one way.
so - let's assume that this problem has been solved, and that there exists SOME method that is acceptable. i suggested "digital signatures" e.g. GPG signing, because that's a good way to do "offline" authentication. and yes, you _can_ use GPG signing as a way to authenticate! basically the server sends a random challenge; the user encrypts it with the private key, and the server decrypts it with the user's public key. as the user had to type their password in order to unlock their GPG private key, and you ASSUME that the user takes good care of their private key, the user is authenticated.
so - let's assume that world-wide authentication can be "taken care of". you now have the means by which an individual, regardless of whether they are using the wikipedia.org site "direct" or whether they are using a p2p node, they can be "identified". but, you've raised an interesting point that indicates some confusion as to the division of services: who is _caring_ about this "appearance", or about this "preferences"? let me be more specific: does a user _care_ that their checkbox, entered originally on wikipedia.org's server, saying "i don't want the javascript editor option thank you", is sent in-the-clear via a p2p network? does a user _care_ that their public signature which is added to all posts, as public information, is sent in-the-clear via a p2p network? is there, in fact, _anything_ about which the user could possibly care about, _especially_ given that the information is sent in-the-clear over HTTP anyway! if the answer is "no", great: we don't care either :)
however, if the answer is "yes" to any of these, then there does exist a technical solution: encryption. and yes, it's possible to get implementations of e.g. AES in javascript. all that's needed is to add, as part of the "authentication session", a means by which a secret session key is negotiated. this secret key can then be utilised as the encryption key to e.g. AES. the user's preferences then ultimately end up being decrypted and kept IN THE USER's BROWSER (or other GUI front-end app). then, it is the USER's BROWSER (or other GUI front-end app) that makes the relevant requests for the formulae (according the the decrypted preferences). all you've done is moved the decision-making OUT of the wikipedia php pages and IN to the user's own web browser (or other GUI front-end). and, you've made the paranoid people even more happy, because their preferences are _their_ preferences, and they're now secure and encrypted and thus under their control (well... not really, but they like to think that they are).
but - leaving that aside, there's an even more slightly fundamental point, and it's this: what the _heck_ do you need to store "user preferences" on the wikipedia.org servers for, anyway, in a p2p-based distributed network?? you have a local cache / service, where _you_ are the only person logging in to it; you have a _local_ web browser: why the heck should the wikipedia.org "user preferences" even come into the equation, especially since, via the p2p node, you'd be getting at the data in "raw" format anyway and the p2p node will have to "reconstruct" it (according to user preferences) _anyway_?
the answer to this is: actually, for convenience, you really might actually want a global username, such that when you move to another workstation, you might want the same "user preferences". under these circumstances, the solution is: the user preferences are encrypted by the user password, on the GUI; the user preferences are decrypted by the GUI; the user preferences are _managed_ by the GUI; everything related to user preferences is done by the GUI, the GUI, the GUI. even if it's the web browser. yes, you _can_ get implementations of AES and even Diffie-Hellman in javascript.
so, ultimately, there are answers for all the questions you've raised, and also for the implied questions, satisfying all of the security issues because they're simply side-stepped: no node - including even the existing wikipedia.org servers - EVER need access to the unencrypted user preferences (except implicitly via a process of reverse-engineering the requests that the user makes, and even that is made more challenging but not impossible in a p2p distributed network. you can't have everything...) Lkcl 20:27, 2 October 2009 (UTC)

Distributed Wikipedia made "completely pointless" by Wikipedia Foundation's "Physical Identification" requirements?

one other scenario, raised by the "User Preferences" discussion. imagine this: "normally", a non-logged-in user to wikipedia.org is, as you say, restricted in the access rights (both reading i presume, and submission of edits). in the case of a p2p node, there's absolutely nothing in the world to stop someone from gaining access to the quotes raw data quotes by submitting p2p queries. everyone is "equal". it's a TRUE democracy. in the p2p node "world", everyone in the world has equal rights (complete freedom) to access wikimedia, historical data, images etc. thus we run into the question that User:Mr.Z-man raised, about requiring "physical identification" (see separate section) in order to access "raw data". but, we've seen, in that section, answers and questions about it. the thing is: if the wikipedia foundation seeks to enforce this "physical identification" issue onto the p2p interface, then that is deeply unimpressive, and also looks like it terminates the usefulness and the whole point of the DW proposal?

no - it wouldn't.

whilst the usefulness of the DW proposal to get at *existing* wikipedia would be made damn awkward, there is absolutely no reason to not provide the DW service itself, and see how it "gets on". especially in light of the fact that a robot could suck the entire wikipedia database and fork it, or, all it takes is *one* individual who has carried out the "physical identification" to prime the entire p2p system with the *existing* wikipedia data, and go from there.

in this way, if, hypothetically, the wikipedia foundation _were_ hypothetically to enforce the hypothetically silly "physical identification" process (on the basis that p2p distribution of wikimedia data is identical to asking for a 4 terabyte backup), the world could _still_ benefit from the advantages that the DW proposal brings (uncensorable unrestricted access to information, and more), and, i suspect hypothetically that hypothetically the existing wikipedia.org infrastructure (hypothetically along with a foundation unable hypothetically to keep up with its responsibilities and duties to bring the wikipedia concept to a wider world) would become hypothetically completely irrelevant. that is, of course a BIG hypothetical "if".

API

Does this offline idea include desktop MediaWiki editors via the API? Ecosystems such as Twitter have exploded because of participation through an API, and it's quickly becoming standard for Web companies and non-profits of many kinds. Steven Walling 01:32, 23 September 2009 (UTC)

effectively, yes. the whole point is that the edited contributions need not go *direct* into the wikipedia servers, but could go into an alternative server (which also supports the same API) perhaps running on the user's own desktop machine, perhaps running on university or on corporate or sponsored servers. the job of that alternate server (which would be running pretty much the exact same service as the wikipedia servers) is to not only act as a local cache of wikipedia pages and re-present the (merged) contributions via the standard wikipedia HTTP service but also to submit the contributions (or not) to the "authoritative" wikipedia servers, whereupon they would be digitally signed after acceptance / merging. the digital signatures would then be distributed out along with the contributions (via the API). the thing is: anyone and everyone gets the exact same rights to do exactly the same thing as wikipedia: make themselves / set themselves up as "authoritative", and it is then up to *users* to decide whom they regard as "authoritative". for an example to follow, take a close look at how debian package management is digitally signed. now treat "contributions" as "packages", and add a standardised peer-to-peer distributed API for the communication of "packages" aka "contributions", and you have the general idea. Lkcl 15:48, 24 September 2009 (UTC)

Late commentary

Looks like I'm about two years too late, but I consider this one of the potential "big ideas" that could really change WM for the better. I hope it happens someday. --Alecmconroy 14:00, 11 June 2011 (UTC)

Return to "Distributed Wikipedia" page.