DHT - e-learning - Dipartimento di Informatica
Transcript
DHT - e-learning - Dipartimento di Informatica
Università degli Studi di Pisa Dipartimento di Informatica Lesson n.4 DISTRIBUTED HASH TABLES: AN INTRODUCTION Laura Ricci 7/10/2014 Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 1 OUTLINE OF THE LESSON Distributed Hash Tables (DHT) what is a DHT ? what are the main functionalities of a DHT? data addressing routing join/leave DHT instances: Chord, CAN, Pastry, Kademlia/KAD DHT applications Formal tools: modelling Chord routing through Markov chains Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 2 SEARCHING IN P2P SYSTEMS I have the information „I“. Where do I store it? I ? Informazione „I“ ? Node A I want to look for information I Where do I find „I“? ? distributed system Node B 12.5.7.31 berkeley.edu planet-lab.org peer-to-peer.info 89.11.20.15 95.7.6.10 86.8.10.18 7.31.10.25 Main problem of a P2P system: content search In Gnutella: content is stored at the peer sharing it Internet Node the main problem, due to the lack of network structuring, is searching. Where is the content with the desired characteristics? Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 3 SEARCHING IN P2P SYSTEMS I have the information „I“. Where do I store it? I ? Informazione „I“ ? Node A I want to look for information I Where do I find „I“? ? distributed system Node B 12.5.7.31 berkeley.edu planet-lab.org peer-to-peer.info 89.11.20.15 95.7.6.10 86.8.10.18 7.31.10.25 Internet Node A stores an information , B wants to find I, but does not know the location of I How to organize the distributed system? What are the mechanisms exploited to decide where the information should be stored and how to find it? Any solution must take into account: system scalability. evaluation of the communication overhead and of the memory needed by any node, as a function of the number of nodes (peers) system adaptability due to faults and to frequent changes (churn) Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 4 SEARCHING/ADDRESSING two strategies for content retrieval in a P2P network searching: guide the search by the value of a set of attributes of the content addressing: associate a unique identifier ID to the content (sometimes called content key) and exploit the ID to address content the mechanism chosen for searching/addressing influences: the construction of the overlay the way the objects are paired with the peers the efficiency of the content search in file sharing P2P systems: searching: for instance implemented thorough aTTL-enahanced floading note that an unique identifier is associated to the content, but it is not exploited for retrieving the object. The object is retrieved by specifying a set of keywords addressing: Distributed Hash Table Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 5 SEARCHING/ADDRESSING Addressing: pair a unique identifier with a content exploit the identifier to look up the content similar to an URL for web advantages: each object is uniquely identified efficient object detection (logarithmic routing) disadvantages: ID computation (hash) maintaining the addressing structure Dipartimento di Informatica Università degli Studi di Pisa Searching: look for contents by specifiyng the values a set of key-words Gnutella 0.4 and 0.6 similar to Google advantages: “user friendly”: it does not require ID computation auxiliary structures disadvanteges search inefficiency need of comparing objects: overhead DHT: An Introduction Laura Ricci whole 6 SEARCHING/ADDRESSING Unstructured Overlay no addressing: TTL enhanced flooding no mapping rule to map content to peers no rule to define the neighbours of a peer: each peer may arbitrarly choose its neighbours this does not imply a complete lack of structure...the network may assume a structure scale free, power-law, small world,.... Structured Overlay define an addressing mechanism overlay topology follows some rule define a mapping rule to map content to peers deterministic routing Distributed Hash Tables (DHT) Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 7 DHT: MOTIVATIONS Centralized Approach: a server indexing the data Search: O(1) – “ Content is stored in a centralized server” Space required: O(N) (N = amount of shared content) Bandwidth (connection server/overlay): O(N) complex queries may be easily managed Fully Distributed Approach: unstructured network Search: worst case O(N2) - “each node contact each of its neighbours” Possible optimizations (TTL, identifiers to avoid cyclic paths) Space Required: O(1) does not depend on the number of nodes in the system no data structure to route queries (flooding) Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 8 DISTRIBUTED HASH TABLES: MOTIVATIONS Analysis of Existing Systems Communication Overhead Flooding O(N) Disadvantages: • Communication overhead • False negative ? O(log N) Does it exist a solution Which is a compromise between the two proposals? O(1) O(1) Dipartimento di Informatica Università degli Studi di Pisa O(log N) Memory DHT: An Introduction Laura Ricci Disadvantages: • Memory, CPU, Required Bandwidth richiesta • Fault Tolerance Centralized Server O(N) 9 DISTRIBUTED HASH TABLES: MOTIVATIONS Scalability: O(log N) Avoid False negative Communication Overhead Flooding O(N) Self Organization : the system automatically manages Disadvantage: • Communication Overhead • False Negative Join of new nodes in the system Leave (volunteer/faults) Distributed Hash Table O(log N) O(1) O(1) Dipartimento di Informatica Università degli Studi di Pisa O(log N) Memory DHT: An Introduction Laura Ricci Disadvantage • Memory, CPU,Required Bandwidth • Fault Tolerance Centralized Server O(N) 10 DISTRIBUTED HASH TABLES: GOALS Main goal is scalability O(log(N)) hops to look up an information O(log(N)) entries in the routing table Routing requires O(log(N)) steps to reach the node storing the information H(„my data“) = 3107 1622 1008 709 2011 2207 611 3485 2906 O(log(N)) size of the routing table of each node 12.5.7.31 berkeley.edu planet-lab.org peer-to-peer.info 89.11.20.15 95.7.6.10 86.8.10.18 Dipartimento di Informatica Università degli Studi di Pisa 7.31.10.25 DHT: An Introduction Laura Ricci 11 DISTRIBUTED HASH TABLES: GOALS Self adapting with respect to faults, join and leave of new nodes re-assignment and re-distribution of content for faults or voluntary leaves from the network Balancing content among nodes important to avoid the increase of the search length Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 12 HASH TABLES: BASIC CONCEPTS Dipartimento di Informatica Università degli Studi di Pisa value insertion: 0,1,4,9,16,25 hash function hash(x) = x mod 10 mapping from an input domain of large size to a smaller output domain the input domain is too large to directly exploit the input key as vector index a bounded number of collections search O(1) DHT: An Introduction Laura Ricci 13 HASH TABLES: BASIC CONCEPTS Insertion, Elimination, look-up: O(1) Hash Table: A fixed size array elements= hash buckets Hash function: it maps keys to vector elements Properties of a good hash function: simple to compute a uniform distribution of the keys in the hash table Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 14 DISTRIBUTED HASH TABLES: BASIC IDEAS Distribute the buckets to the peers: Distributed Hash Tables collision resistant function This requires: a mechanism to define which is the peer responsible for a bucket a routing mechanism to find the peer managing a bucket Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 15 DISTRIBUTED HASH TABLES A first solution hash(data) returns the index of a peer hash function depends on number of peers does not work efficiently for inserting and deleting items: needs to completely repartition the hash table The DHT solution: peers are „hashed“ to a very big logical space index data is also „hashed“ to this space peers are given a set of buckets in this space All data falling in a segment is mapped to the corresponding peer Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 16 DISTRIBUTED HASH TABLES: THE BASIC IDEA in a DHT, every node is responsible for the management of one or more buckets when a node enters/leaves the network it passes the bucket to another node the nodes communicate among them to detect the node managing a bucket requires a communication mechanism scalable and efficient all the operations of a classical hash table are supported Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 17 DISTRIBUTED HASH TABLES: THE BASIC IDEA The mechanism exploited to detect the peer managing a given bucket characterizes the DHT The typical behaviour a node knows the key of the content it is looking for routing brings to the node which is the responsible of the bucket where the key is located the node which is responsible of that bucket, directly sends the content or a pointer to the content Abstraction defined by a DHT stores (key-value) pairs given a key, the DHT returns the corresponding value the DHT pairs no semantics with the pair key/value Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 18 DHT: DISTRIBUTED DATA MANAGEMENT Nodes and data are mapped in the same address space an unique identifier is paired with each peer with each data stored in the P2P network a single logic space for data and peers nodes are responsible for the management of a portion of the addresses logical space (one or more buckets) the correspondence between data and nodes may vary due to the join/leave of the nodes Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 19 DHT: DISTRIBUTED DATA MANAGEMENT Data store/search Data search = routing towards the node responsible of that data Each node maintains a routing table, which gives to the node a partial view of the system Key-based Routing: Routing is guided by the knowledge of the unique identifier(key) of the data which is looked up False negatives are avoided Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 20 STEP 1: DHT ADDRESSING The simplest scenario: nodes and content mapped in a linear address space 0, …, 2m-1 The size of the linear logical address space is >> with respect to the number of objects to store (es m=160). A total order is defined on the address space (MOD operations) For instance, address space may be structured as a logical ring S Mapping data-logical addresses through a hash function Hash(String) mod 2m, , for instance Hash(''mydata'')=2313 Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 21 STEP 2: MAPPING NODES/LOGICAL ADDRESSES Each node is responsible of a contiguous portion of the address space (some buckets). Data are mapped into the same logical address space of the nodes, through the hash function E.g., Hash(String): H(“'MyData“) 2313 Examples: hashing of the file name or of its content Each node stores information related to the data mapped onto its portion of the address space Some replication (redundancy) is often introduced Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 22 STEP 2: MAPPING NODES/LOGICAL ADDRESSES Each node is responsible of an interval of identifiers Interval overlapping introduces a level of redundancy Continuous adaptation Underlay topology and logical overlay are not correlated Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 23 STEP 3: DATA STORAGE Direct Storage Indirect Storage Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 24 STEP 3: DIRECT STORAGE DHT stores pairs (key,value) The data is stored, when it is inserted in the DHT, onto the responsible node such a node is not, in general, the node which has inserted the data into the DHT An example: key = H(“Data”) = 3107. Data is stored onto the node which manages the address portion including the address 3107. 1008 709 611 D D 1622 D 2011 3485 2207 2906 HSHA-1(''Dato'')=3107 134.2.11.68 Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 25 STEP 3: INDIRECT STORAGE Value = may be a reference to the data (ex: the physical address of the node storing the content) The node storing the content may be the node which has inserted the content into the system A flexible solution, but it needs a further step to access the data 1008 709 611 1622 2011 3485 2207 2906 HSHA-1(„Dato“)=3107 D D: 134.2.11.68 134.2.11.68 Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 26 STEP 4: ROUTING search for D may start at an arbitrary node of the DHT and is guided by key=HASH(D) each node has a partial vision of the other nodes Next hop: depends from the routing algorithm content-based-routing: for instance based on the closeness between the data and the node IDs of the nodes in the routing table Value paired with the key: IP address + port of the peer storing D. Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 27 STEP 5: DATA RETRIEVAL Content Download Send IP address and port to the requesting peer If indirect addressing is exploited, the requesting peer download the content from a third peer. Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 28 DHT: LOAD BALANCING Main reasons for load unbalance in the distribution of address intervals: a node manages a bigger portion of the logical address space solution: exploit an uniform hash function the address spaces are uniformly distributed among the nodes but the addresses managed by a node correspond to lot of data a node manages a lot of queries, because the data paired with the addresses assigned to it are very popular Load unbalance implies: less system robustness less scalability O(log N) bounds are not guaranteed Solutions: Uniform hash functions Load balancing algorithms definitions Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 29 DHT: JOINS AND LEAVES Distributed Hash Table peers are hashed to a linear space content are hashed according to the search key peers store index data in their areas when a peer joins neighbour peers share their areas with the new peer when a peer leaves the neighbours inherit the responsibilities for the the data of the leaving peers Which neighbours depend from the DHT Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 30 DHT: NODE JOIN compute the node unique identifier contact an arbitrary node of the DHT (bootstrap node) detect the exact point of the DHT where to join (predecessor and successor node) assign a portion of the logical address space to the new peer copy the assigned Key/value pairs (with redundancy) insertion in the DHT (connect with the proper neighbours) 709 1008 1622 2011 2207 611 2906 3485 ID: 3485 134.2.11.68 Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 31 DHT: NODE LEAVE Voluntary leave of a node partitioning of its address space to the neighbour nodes copy key/value pairs to the corresponding nodes deletion of the node from the routing tables of the other nodes Node failure If a node suddenly disconnect from the network, all data stored on it are lost if they are not stored on other nodes introduce some redundancy (data replication) information loss: periodical information refresh Exploit alternative/redundant routing paths periodical probing of the neighbour nodes to detect their activity. When a fault is detected, update routing tables. Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 32 COMPARING DIFFERENT APPROACHES Approach Central Server Pure P2P (flooding) DHT Memory for Communication Complex each node Overhead O(N) Queries False Negatives Robustness O(1) O(1) O(N²) O(log N) O(log N) Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 33 DHT: API API to access a DHT content insertion: content search GET(key) replies PUT(key,value) Value The interface is common to several DHT systems Distributed Application Put(Key,Value) Get(Key) Value Distributed Hash Table (CAN, Chord, Pastry, Tapestry, …) Node 1 Dipartimento di Informatica Università degli Studi di Pisa Node 2 Node 3 .... DHT: An Introduction Laura Ricci Node N 34 DHT: APPLICATIONS DHT offer a generic distributed service for information storing and indexing The value paired with a key may be a file an IP address or every further data…… Applications exploiting a DHT DNS implementation key: host name, value: list of corresponding IP addresses P2P storage systems: example Freenet, PAST Define a support for higher level services …… Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 35 CONCLUSIONS DHT Properties routing is based on key (unique identifier) key are uniformly distributed to the DHT nodes bottleneck avoidance incremental insertion of the keys fault tolerance auto organizing system simplex and efficient organization the terms “Structured Peer-to-Peer“ and “DHT“ are often used as synonyms Support several applications The values paired with the keys depend on the application Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 36 DHT: EXISTING SYSTEMS Chord Pastry Tapestry CAN P-Grid UC Berkeley, MIT Microsoft Research, Rice University UC Berkeley UC Berkeley, ICSI EPFL Lausanne Kademlia , KAD network of e-Mule... Symphony, Viceroy, … Dipartimento di Informatica Università degli Studi di Pisa DHT: An Introduction Laura Ricci 37
Documenti analoghi
p2p storage networks - e-learning
the connection of the to Internet is intermittent : each time the user obtains a
new IP address for each new connection
the user stores the shared files in a directory and pairs each file with a se...
Architetture e Protocolli nelle Reti Peer-to-Peer
The peers connect to a central directory where they publish informations
about the content they offer for sharing.
Upon request from a peer, the central index will find the best peer that
matches t...