DHT - e-learning - Dipartimento di Informatica
Transcript
DHT - e-learning - Dipartimento di Informatica
Università degli Studi di Pisa
Dipartimento di Informatica
Lesson n.4
DISTRIBUTED HASH TABLES:
AN INTRODUCTION
Laura Ricci
7/10/2014
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
1
OUTLINE OF THE LESSON
Distributed Hash Tables (DHT)
what is a DHT ?
what are the main functionalities of a DHT?
data addressing
routing
join/leave
DHT instances:
Chord, CAN, Pastry, Kademlia/KAD
DHT applications
Formal tools: modelling Chord routing through Markov chains
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
2
SEARCHING IN P2P SYSTEMS
I have the information „I“.
Where do I store it?
I
?
Informazione „I“
?
Node A
I want to look for
information I Where
do I find „I“?
?
distributed system
Node B
12.5.7.31
berkeley.edu
planet-lab.org
peer-to-peer.info
89.11.20.15
95.7.6.10
86.8.10.18
7.31.10.25
Main problem of a P2P system: content search
In Gnutella: content is stored at the peer sharing it
Internet Node
the main problem, due to the lack of network structuring, is searching.
Where is the content with the desired characteristics?
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
3
SEARCHING IN P2P SYSTEMS
I have the information „I“.
Where do I store it?
I
?
Informazione „I“
?
Node A
I want to look for
information I Where
do I find „I“?
?
distributed system
Node B
12.5.7.31
berkeley.edu
planet-lab.org
peer-to-peer.info
89.11.20.15
95.7.6.10
86.8.10.18
7.31.10.25
Internet Node
A stores an information , B wants to find I, but does not know the location of I
How to organize the distributed system? What are the mechanisms exploited
to decide where the information should be stored and how to find it?
Any solution must take into account:
system scalability. evaluation of the communication overhead and of the
memory needed by any node, as a function of the number of nodes (peers)
system adaptability due to faults and to frequent changes (churn)
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
4
SEARCHING/ADDRESSING
two strategies for content retrieval in a P2P network
searching: guide the search by the value of a set of attributes of the
content
addressing: associate a unique identifier ID to the content (sometimes
called content key) and exploit the ID to address content
the mechanism chosen for searching/addressing influences:
the construction of the overlay
the way the objects are paired with the peers
the efficiency of the content search
in file sharing P2P systems:
searching: for instance implemented thorough aTTL-enahanced floading
note that an unique identifier is associated to the content, but it is
not exploited for retrieving the object. The object is retrieved by
specifying a set of keywords
addressing: Distributed Hash Table
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
5
SEARCHING/ADDRESSING
Addressing:
pair a unique identifier with a
content
exploit the identifier to look up
the content
similar to an URL for web
advantages:
each object is uniquely identified
efficient
object
detection
(logarithmic routing)
disadvantages:
ID computation (hash)
maintaining
the
addressing
structure
Dipartimento di Informatica
Università degli Studi di Pisa
Searching:
look for contents by specifiyng the
values a set of key-words
Gnutella 0.4 and 0.6
similar to Google
advantages:
“user friendly”: it does not
require
ID computation
auxiliary structures
disadvanteges
search inefficiency
need of comparing
objects: overhead
DHT: An Introduction
Laura Ricci
whole
6
SEARCHING/ADDRESSING
Unstructured Overlay
no addressing: TTL enhanced flooding
no mapping rule to map content to peers
no rule to define the neighbours of a peer: each peer may arbitrarly
choose its neighbours
this does not imply a complete lack of structure...the network may assume
a structure
scale free, power-law, small world,....
Structured Overlay
define an addressing mechanism
overlay topology follows some rule
define a mapping rule to map content to peers
deterministic routing
Distributed Hash Tables (DHT)
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
7
DHT: MOTIVATIONS
Centralized Approach: a server indexing the data
Search: O(1) – “ Content is stored in a centralized server”
Space required: O(N) (N = amount of shared content)
Bandwidth (connection server/overlay): O(N)
complex queries may be easily managed
Fully Distributed Approach: unstructured network
Search: worst case O(N2) - “each node contact each of its neighbours”
Possible optimizations (TTL, identifiers to avoid cyclic paths)
Space Required: O(1)
does not depend on the number of nodes in the system
no data structure to route queries (flooding)
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
8
DISTRIBUTED HASH TABLES: MOTIVATIONS
Analysis of Existing Systems
Communication
Overhead
Flooding
O(N)
Disadvantages:
• Communication overhead
• False negative
?
O(log N)
Does it exist a solution
Which is a compromise between
the two proposals?
O(1)
O(1)
Dipartimento di Informatica
Università degli Studi di Pisa
O(log N) Memory
DHT: An Introduction
Laura Ricci
Disadvantages:
• Memory, CPU, Required
Bandwidth richiesta
• Fault Tolerance
Centralized
Server
O(N)
9
DISTRIBUTED HASH TABLES: MOTIVATIONS
Scalability: O(log N)
Avoid False negative
Communication
Overhead
Flooding
O(N)
Self Organization : the system automatically
manages
Disadvantage:
• Communication Overhead
• False Negative
Join of new nodes in the system
Leave (volunteer/faults)
Distributed
Hash Table
O(log N)
O(1)
O(1)
Dipartimento di Informatica
Università degli Studi di Pisa
O(log N) Memory
DHT: An Introduction
Laura Ricci
Disadvantage
• Memory, CPU,Required
Bandwidth
• Fault Tolerance
Centralized
Server
O(N)
10
DISTRIBUTED HASH TABLES: GOALS
Main goal is scalability
O(log(N)) hops to look up an information
O(log(N)) entries in the routing table
Routing requires
O(log(N)) steps to reach
the node storing the
information
H(„my data“)
= 3107
1622
1008
709
2011
2207
611
3485
2906
O(log(N)) size of the
routing table of each
node
12.5.7.31
berkeley.edu
planet-lab.org
peer-to-peer.info
89.11.20.15
95.7.6.10
86.8.10.18
Dipartimento di Informatica
Università degli Studi di Pisa
7.31.10.25
DHT: An Introduction
Laura Ricci
11
DISTRIBUTED HASH TABLES: GOALS
Self adapting with respect to faults, join and leave of new nodes
re-assignment and re-distribution of content for faults or voluntary
leaves from the network
Balancing content among nodes
important to avoid the increase of the search length
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
12
HASH TABLES: BASIC CONCEPTS
Dipartimento di Informatica
Università degli Studi di Pisa
value insertion: 0,1,4,9,16,25
hash function
hash(x) = x mod 10
mapping from an input domain of large size
to a smaller output domain
the input domain is too large to
directly exploit the input key as vector
index
a bounded number of collections
search O(1)
DHT: An Introduction
Laura Ricci
13
HASH TABLES: BASIC CONCEPTS
Insertion, Elimination, look-up: O(1)
Hash Table: A fixed size array
elements= hash buckets
Hash function: it maps keys to vector elements
Properties of a good hash function:
simple to compute
a uniform distribution of the keys in the hash table
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
14
DISTRIBUTED HASH TABLES: BASIC IDEAS
Distribute the buckets to the peers:
Distributed Hash Tables
collision resistant function
This requires:
a mechanism to define which is
the peer responsible for a bucket
a routing mechanism to find the
peer managing a bucket
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
15
DISTRIBUTED HASH TABLES
A first solution
hash(data) returns the index of a peer
hash function depends on number of
peers
does not work efficiently for inserting
and deleting items: needs to completely
repartition the hash table
The DHT solution:
peers are „hashed“ to a very big
logical space
index data is also „hashed“ to this space
peers are given a set of buckets in this
space
All data falling in a segment is mapped to
the corresponding peer
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
16
DISTRIBUTED HASH TABLES: THE BASIC IDEA
in a DHT, every node is responsible
for the management of one or
more buckets
when a node enters/leaves the
network it passes the bucket
to another node
the nodes communicate among
them to detect the node managing
a bucket
requires
a
communication
mechanism
scalable
and
efficient
all the operations of a classical
hash table are supported
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
17
DISTRIBUTED HASH TABLES: THE BASIC IDEA
The mechanism exploited to detect the peer managing a given bucket
characterizes the DHT
The typical behaviour
a node knows the key of the content it is looking for
routing brings to the node which is the responsible of the bucket
where the key is located
the node which is responsible of that bucket, directly sends the
content or a pointer to the content
Abstraction defined by a DHT
stores (key-value) pairs
given a key, the DHT returns the corresponding value
the DHT pairs no semantics with the pair key/value
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
18
DHT: DISTRIBUTED DATA MANAGEMENT
Nodes and data are mapped in the same address space
an unique identifier is paired
with each peer
with each data stored in the P2P network
a single logic space for data and peers
nodes are responsible for the management of a portion of the
addresses logical space (one or more buckets)
the correspondence between data and nodes may vary due to the
join/leave of the nodes
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
19
DHT: DISTRIBUTED DATA MANAGEMENT
Data store/search
Data search = routing towards the node responsible of that data
Each node maintains a routing table, which gives to the node a partial
view of the system
Key-based Routing: Routing is guided by the knowledge of the unique
identifier(key) of the data which is looked up
False negatives are avoided
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
20
STEP 1: DHT ADDRESSING
The simplest scenario: nodes and content mapped in a linear address
space 0, …, 2m-1
The size of the linear logical address space is >> with respect to the
number of objects to store (es m=160).
A total order is defined on the address space (MOD operations)
For instance, address space may be structured as a logical ring S
Mapping data-logical addresses through a hash function
Hash(String) mod 2m, , for instance Hash(''mydata'')=2313
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
21
STEP 2: MAPPING NODES/LOGICAL ADDRESSES
Each node is responsible of a contiguous portion of the address space (some
buckets).
Data are mapped into the same logical address space of the nodes, through
the hash function
E.g., Hash(String): H(“'MyData“) 2313
Examples: hashing of the file name or of its content
Each node stores information related to the data mapped onto its portion of
the address space
Some replication (redundancy) is often introduced
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
22
STEP 2: MAPPING NODES/LOGICAL ADDRESSES
Each node is responsible of an interval of identifiers
Interval overlapping introduces a level of redundancy
Continuous adaptation
Underlay topology and logical overlay are not correlated
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
23
STEP 3: DATA STORAGE
Direct Storage
Indirect Storage
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
24
STEP 3: DIRECT STORAGE
DHT stores pairs (key,value)
The data is stored, when it is inserted in the DHT, onto the responsible node
such a node is not, in general, the node which has inserted the data into
the DHT
An example: key = H(“Data”) = 3107. Data is stored onto the node which
manages the address portion including the address 3107.
1008
709
611
D
D
1622
D
2011
3485
2207
2906
HSHA-1(''Dato'')=3107
134.2.11.68
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
25
STEP 3: INDIRECT STORAGE
Value = may be a reference to the data (ex: the physical address of the node
storing the content)
The node storing the content may be the node which has inserted the
content into the system
A flexible solution, but it needs a further step to access the data
1008
709
611
1622
2011
3485
2207
2906
HSHA-1(„Dato“)=3107
D
D: 134.2.11.68
134.2.11.68
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
26
STEP 4: ROUTING
search for D may start at an arbitrary node of the DHT and is guided by
key=HASH(D)
each node has a partial vision of the other nodes
Next hop: depends from the routing algorithm
content-based-routing: for instance based on the closeness between the
data and the node IDs of the nodes in the routing table
Value paired with the key: IP address + port of the peer storing D.
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
27
STEP 5: DATA RETRIEVAL
Content Download
Send IP address and port to the requesting peer
If indirect addressing is exploited, the requesting peer download the
content from a third peer.
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
28
DHT: LOAD BALANCING
Main reasons for load unbalance in the distribution of address intervals:
a node manages a bigger portion of the logical address space
solution: exploit an uniform hash function
the address spaces are uniformly distributed among the nodes but the
addresses managed by a node correspond to lot of data
a node manages a lot of queries, because the data paired with the
addresses assigned to it are very popular
Load unbalance implies:
less system robustness
less scalability
O(log N) bounds are not guaranteed
Solutions:
Uniform hash functions
Load balancing algorithms definitions
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
29
DHT: JOINS AND LEAVES
Distributed Hash Table
peers are hashed to a linear space
content are hashed according to the
search key
peers store index data
in their areas
when a peer joins
neighbour peers share their areas
with the new peer
when a peer leaves
the neighbours inherit the
responsibilities for the
the data of the leaving peers
Which neighbours depend from the DHT
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
30
DHT: NODE JOIN
compute the node unique identifier
contact an arbitrary node of the DHT (bootstrap node)
detect the exact point of the DHT where to join (predecessor and
successor node)
assign a portion of the logical address space to the new peer
copy the assigned Key/value pairs (with redundancy)
insertion in the DHT (connect with the proper neighbours)
709
1008
1622
2011
2207
611
2906
3485
ID: 3485
134.2.11.68
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
31
DHT: NODE LEAVE
Voluntary leave of a node
partitioning of its address space to the neighbour nodes
copy key/value pairs to the corresponding nodes
deletion of the node from the routing tables of the other nodes
Node failure
If a node suddenly disconnect from the network, all data stored on it are
lost if they are not stored on other nodes
introduce some redundancy (data replication)
information loss: periodical information refresh
Exploit alternative/redundant routing paths
periodical probing of the neighbour nodes to detect their activity.
When a fault is detected, update routing tables.
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
32
COMPARING DIFFERENT APPROACHES
Approach
Central
Server
Pure P2P
(flooding)
DHT
Memory for
Communication
Complex
each node
Overhead
O(N)
Queries
False
Negatives
Robustness
O(1)
O(1)
O(N²)
O(log N)
O(log N)
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
33
DHT: API
API to access a DHT
content insertion:
content search
GET(key)
replies
PUT(key,value)
Value
The interface is common to several DHT systems
Distributed Application
Put(Key,Value)
Get(Key)
Value
Distributed Hash Table
(CAN, Chord, Pastry, Tapestry, …)
Node 1
Dipartimento di Informatica
Università degli Studi di Pisa
Node 2
Node 3
....
DHT: An Introduction
Laura Ricci
Node N
34
DHT: APPLICATIONS
DHT offer a generic distributed service for information storing and indexing
The value paired with a key may be
a file
an IP address
or every further data……
Applications exploiting a DHT
DNS implementation
key: host name, value: list of corresponding IP addresses
P2P storage systems: example Freenet, PAST
Define a support for higher level services
……
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
35
CONCLUSIONS
DHT Properties
routing is based on key (unique identifier)
key are uniformly distributed to the DHT nodes
bottleneck avoidance
incremental insertion of the keys
fault tolerance
auto organizing system
simplex and efficient organization
the terms “Structured Peer-to-Peer“ and “DHT“ are often used as
synonyms
Support several applications
The values paired with the keys depend on the application
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
36
DHT: EXISTING SYSTEMS
Chord
Pastry
Tapestry
CAN
P-Grid
UC Berkeley, MIT
Microsoft Research, Rice University
UC Berkeley
UC Berkeley, ICSI
EPFL Lausanne
Kademlia , KAD network of e-Mule...
Symphony, Viceroy, …
Dipartimento di Informatica
Università degli Studi di Pisa
DHT: An Introduction
Laura Ricci
37
Documenti analoghi
p2p storage networks - e-learning
the connection of the to Internet is intermittent : each time the user obtains a
new IP address for each new connection
the user stores the shared files in a directory and pairs each file with a se...
Architetture e Protocolli nelle Reti Peer-to-Peer
The peers connect to a central directory where they publish informations
about the content they offer for sharing.
Upon request from a peer, the central index will find the best peer that
matches t...