Design
Framework Choice and Data Distribution Design
How the inverted index should be distributed with the DHT
Original Plan
Indexing
Initally the plan we developed was to have the crawler send the extracted and structured documents to it's local full-text-search, which would break it up and assemble groups of keywords. Those groups of keywords are to be split up accross the p2p network, turned into an inverted index and persisted with the dht.
Retrieval
The posting lists of the inverted index are available to the full-text-search component through an http server, which answers the UI's queries for searches.
Choosing a p2p library
I had decided that libp2p would be the p2p framework to be used, as it fullfilled all the other demands of the project:
- net and routing (different configurable ways to run/setup the network)
- content routing (put/get using key-value pairs)
- static connection addresses
- well maintained and documented
- efficient transports
- well tested with checks and corrections throughout to maintain consistent state and functionality
It was preferable to:
- using a smaller (and/or outdated) library that didn't cover the demands above
- building the system from the ground up, for which the timeframe would be tight and it probably would not reach libp2p's level of consistency
libp2p's Constraints
One constraint that comes along with using libp2p is that the put separates local and remote insertion. Looking at the implementation of put it always inserts locally and then remotely into the network. This is not to our specification as we did not want to make a distinction between local and remote but just wanted to insert on the peer(s) closest to the hashed key of the value to be inserted.
Possible changes to the design to address this
1. local & remote
Keep current design and use the given implementation to insert once locally and n-times remotely.
Pro: - scalable: the processing of keywords -> posting lists happens distributed accross the network
Contra: - suboptimal inverted index distribution: the algorithm used in the full-text-search component - to divvy up the keywords to different peers for computation - would be responsible for distribution the local part of the data accross the network - no separation of concerns: again the algorithm just mentioned handles functionality that should be part of the p2p net
2. remote only
Designate a peer to become the central crawler and inserter instance and only insert remotely.
Pro: - optimal inverted index distribution: by inserted in the closest peer, to the specification of kademlia this would be the most optimal distribution of the data accross the network - simple implementation: least effort to implement (no balancing algorithm in full-text-search, trivial adjustment in the framework)
Contra: - not scalable: to have all the computation of creating the index done on a single peer has a natural limit and is not an efficient use of the networks resources - resource imbalance: increased data load on the rest of the peers, storage resources unused (hd and cpu (see previous point)) - single point of failure: not critial though, as the indexing only happens upfront and very occasionally throughout service
3. local/remote
Fork the framework and adjust it to do local and remote put/get through the same routines. This would be the optimal solution and satifies the original design.
Pro: - optimal inverted index distribution: same as with 2. - optimal resource usage: storage and cpu load are distributed equally - scalable: same as with 1.
Contra: - stability/introduction of inconsistencies: requires heavy modification of the framework, which will possibly circumvent checks and corrections within it and produce a less stable framework as a result - a lot of work: most effort by far, I tried implementing this and hit one hurdle after another, the underlying system (kademlia and k-buckets below libp2p) definitely allow for this change, but you are just fighting against libp2p at every point
Our choice
We decided upon moving forward with "remote only" (2.) as the downsides are very much tolerable along with having the best of both words on the pro side.
API Specification
openapi: 3.1.0
info:
title: P2P framework
version: 0.2
paths:
/:key:
get:
summary: Get an entry from hash table
description: data is an array
parameters:
- name: key
in: path
responses:
"200":
description: OK
content:
schema:
format:
error: "boolean"
key: "string"
value: "any[]"
example:
error: false
key: "Grundgesetz"
value: [ "postingList1", "postingList2" ]
"400":
description: User error
content:
schema:
format:
error: "boolean"
errorMsg: "string"
example:
error: true
errorMsg: "No key given"
/append/:key:
put:
summary: Append to an entry in the hash table
description: Entities are an array, to which you can add a single value through this route
parameters:
- name: key
in: path
- name: data
in: body
schema:
format:
data: "any"
example:
data: { id: 123 }
responses:
"200":
description: OK
content:
schema:
format:
error: "boolean"
key: "string"
example:
error: false
key: "Grundgesetz"
"400":
description: User error
content:
schema:
format:
error: "boolean"
errorMsg: "string"
example:
error: true
errorMsg: "missing data"
/merge/:key:
put:
summary: Merge with the array in the hash table
description: Entities are an array, to which you can add a mutiple values through this route by wrapping them in an array
parameters:
- name: key
in: path
- name: data
in: body
schema:
format:
data: "any[]"
example:
data: [ { id: 123 }, { id: 124 } ]
responses:
"200":
description: OK
content:
schema:
format:
error: "boolean"
key: "string"
example:
error: false
key: "Grundgesetz"
"400":
description: User error
content:
schema:
format:
error: "boolean"
errorMsg: "string"
example:
error: true
errorMsg: "missing data"
/batch-get:
post:
summary: Get multiple entries of the hash table
description:
parameters:
- name: keys
in: body
schema:
format:
data: "string[]"
responses:
"200":
description: OK
content:
schema:
format:
error: "boolean"
keys: "string[]"
values: "object"
example:
error: false
keys: [ "Grundgesetz", "Merkel" ]
values: {
"Grungesetz": {
"error": false,
"value": [ "postingList1", "postingList2" ]
},
"Merkel": {
"error": true,
"errorMsg": "Not found"
}
"400":
description: User error
content:
schema:
format:
error: "boolean"
errorMsg: "string"
example:
error: true
errorMsg: "missing keys"
Also there is a key mangaged by the system named _keyset_size
, which tracks
the number of keysets in the dht.
API Examples
Setup to run the examples
./spawnnode 1 &
#=> Server running on http://localhost:8091
./spawnnode 2 &
#=> Server running on http://localhost:8092
./spawnnode 3 &
#=> Server running on http://localhost:8093
PUT /append/:key
curl -Ss -X PUT "http://localhost:8093/append/linux" -d '{"data":"arch"}' -H "Content-Type: application/json"
curl -Ss -X PUT "http://localhost:8093/append/linux" -d '{"data":"debian"}' -H "Content-Type: application/json"
{
"error": false,
"key": "linux"
}
PUT /merge/:key
curl -Ss -X PUT "http://localhost:8093/merge/linux" -d '{"data":["ubuntu","manjaro"]}' -H "Content-Type: application/json"
{
"error": false,
"key": "linux"
}
GET /:key
curl -Ss "http://localhost:8091/linux"
{
"error": false,
"key": "linux",
"value": [
"arch",
"debian",
"ubuntu",
"manjaro"
]
}
POST /batch-get
curl -Ss -X POST "http://localhost:8092/batch-get" -d '{"keys":["linux","windows"]}' -H "Content-Type: application/json"
{
"error": false,
"keys": [
"linux",
"windows"
],
"values": {
"linux": {
"error": false,
"value": [
"arch",
"debian",
"ubuntu",
"manjaro"
]
},
"windows": {
"error": true,
"errorMsg": "Not found"
}
}
}