ABSTRACT
Peer-to-peer computing consists of an open-ended network of distributed computational peers, where each peer shares data and services with a set of other peers, called its acquaintances. The peer-to-peer paradigm was initially popularized by file-sharing systems such as Napster and Gnutella, but its basic ideas and principles have now found their way into more critical and complex data-sharing applications like those for electronic medical records and scientific data. In such environments, data sharing poses new challenges mainly due to the lack of centralized control, the transient nature of inter-peer connections, and the limited, ever-changing cooperation among the peers.
In the seminar we can present new solutions for data sharing and querying in a peer-to-peer data management system, that is, a peer-to-peer system where each peer manages its own database. The solutions are motivated by considering data sharing requirements of independent biological data sources. To support data sharing in such a setting, I propose the use of mapping tables containing pairs of corresponding data values that reside in different peers. I illustrate how automated tools can help manage the tables by checking their consistency and by inferring new tables from existing ones. To support structured querying, I propose a framework in which local user queries are translated, through mapping tables, to a set of queries over the acquainted peers. Finally, I present optimization techniques that enable an efficient rewriting even over large mapping tables. The proposed mechanisms have been implemented and evaluated experimentally and constitute the foundation of a prototype implementation of architecture for peer-to-peer data management.
The term “peer-to-peer” (P2P) refers to a class of systems and applications that employ distributed resources to perform a function in a decentralized manner. With the pervasive deployment of computers, P2P is increasingly receiving attention in research, product development, and investment circles. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; and enabling resource aggregation.
This survey reviews the field of P2P systems and applications by summarizing the key concepts and giving an overview of the most important systems. Design and implementation issues of P2P systems are analyzed in general, and then revisited for eight case studies. This survey will help people in the research community and industry understands the potential benefits of P2P. For people unfamiliar with the field it provides a general overview, as well as detailed case studies. Comparison of P2P solutions with alternative architectures is intended for users, developers, and system administrators (IT).
1. Introduction
1.1. Introduction
Peer-to-Peer (P2P) computing is a very controversial topic. Many experts believe that there is not much new in P2P. There is a lot of confusion: what really constitutes P2P? For example, is distributed computing really P2P or not? We believe that P2P does warrant a thorough analysis. The goals of the paper are threefold: 1) to understand what P2P is and it is not, as well as what is new, 2) to offer a thorough analysis of and examples of P2P computing, and 3) to analyze the potential of P2P computing.
The term “peer-to-peer” refers to a class of systems and applications that employ distributed resources to perform a function in a decentralized manner. The resources encompass computing power, data (storage and content), network bandwidth, and presence (computers, human, and other resources). The critical function can be distributed computing, data/content sharing, communication and collaboration, or platform services. Decentralization may apply to algorithms, data, and meta-data, or to all of them. This does not preclude retaining centralization in some parts of the systems and applications. Typical P2P systems reside on the edge of the Internet or in ad-hoc networks. P2P enables:
•Valuable externalities, by aggregating resources through low-cost interoperability, the whole is made greater than the sum of its parts
• lower cost of ownership and cost sharing, by using existing infrastructure and by eliminating or distributing the maintenance costs
• Anonymity/privacy, by incorporating these requirements in the design and algorithms of P2P systems and applications, and by allowing peers a greater degree of autonomous control over their data and resources
However, P2P also raises some security concerns for users and accountability concerns for IT. In general it is still a technology in development where it is hard to distinguish useful from hype and new from old. In the rest of the paper we evaluate these observations in general as well as for specific P2P systems and applications.
P2P gained visibility with Napster’s support for music sharing on the Web [Napster 2001] and its lawsuit with the music companies. However, it is increasingly becoming an important technique in various areas, such as distributed and collaborative computing both on the Web and in ad-hoc networks. P2P has received the attention of both industry and academia. Some big industrial efforts include the P2P Working Group, led by many industrial partners such as Intel, HP, Sony, and a number of startup companies; and JXTA, an open-source effort led by Sun. There are already a number of books published [Oram 2000, Barkai 2001, Miller 2001, Moore and Hebeler 2001, Fattah and Fattah 2002], and a number of theses and projects in progress at universities, such as Chord [Stoica et al 2001], OceanStore [Kubiatowicz et al.
2000], PAST [Druschel and Rowstron 2001], CAN [Ratnasamy 2001], and FreeNet [Clark 1999].
Here are several of the definitions of P2P that are being used by the P2P community. The Intel P2P working group defines P2P as “the sharing of computer resources and services by direct exchange between systems”
[p2pwg 2001]. David Anderson calls SETI@home and similar P2P projects that do not involve communication as “inverted client-server”, emphasizing that the computers at the edge provide power and those in the middle of the network are there only to coordinate them [Anderson 2002]. Alex Weytsel of Aberdeen defines P2P as “the use of devices on the internet periphery in a non-client capacity” [Veytsel 2001]. Clay Shirky of O’Reilly and Associate uses the following definition: “P2P is a class of applications that takes advantage of resources – storage, cycles, content, human presence – available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, P2P nodes must operate outside the DNS system and have significant or total autonomy from central servers” [Shirky 2001]. Finally, Kindberg defines P2P systems as those with independent lifetimes [Kindberg 2002].
In our view, P2P is about sharing: giving to and obtaining from a peer community. A peer gives some resources and obtains other resources in return. In the case of Napster, it was about offering music to the rest of the community and getting other music in return. It could be donating resources for a good cause, such as searching for extraterrestrial life or combating cancer, where the benefit is obtaining the satisfaction of helping others. P2P is also a way of implementing systems based on the notion of increasing the decentralization of systems, applications, or simply algorithms. It is based on the principles that the world will be connected and widely distributed and that it will not be possible or desirable to leverage everything off of centralized, administratively managed infrastructures. P2P is a way to leverage vast amounts of computing power, storage, and connectivity from personal computers distributed around the world.
Assuming that “peer” is defined as “like each other,” a P2P system then is one in which autonomous peers depend on other autonomous peers. Peers are autonomous when they are not wholly controlled by each other or by the same authority, e.g., the same user. Peers depend on each other for getting information, computing resources, forwarding requests, etc. which are essential for the functioning of the system as a whole and for the benefit of all peers. As a result of the autonomy of peers, they cannot necessarily trust each other and rely completely on the behavior of other peers, so issues of scale and redundancy become much more important than in traditional centralized or distributed systems.
Figure: Simplified, High-Level View of Peer-to-Peer versus Centralized (Client-Server) Approach.
Conceptually, P2P computing is an alternative to the centralized and client-server models of computing, where there is typically a single or small cluster of servers and many clients (see Figure 1). In its purest form, the P2P model has no concept of server; rather all participants are peers. This concept is not necessarily new. Many earlier distributed systems followed a similar model, such as UUCP [Nowitz 1978] and switched networks [Tanenbaum 1981]. The term P2P is also not new. In one of its simplest forms, it refers to the communication among the peers. For example, in telephony users talk to each other directly once the connection is established, in a computer networks the computers communicate P2P, and in games, such as Doom, players also interact directly. However, a comparison between client-server and P2P computing is significantly more complex and intertwined along many dimensions. Figure is an attempt to compare some aspects of these two models.
Distributed System
Figure: Peer-to-Peer versus Client-Server. There is no clear border between a client-server and a P2P model. Both models can be built on a spectrum of levels of characteristics (e.g., manageability, configurability), functionality (e.g., lookup versus discovery), organizations (e.g., hierarchy versus mesh), components (e.g., DNS), and protocols (e.g., IP), etc. Furthermore, one model can be built on top of the other or parts of the components can be realized in one or the other model. Finally, both models can execute on different types of platforms (Internet, intranet, etc.) and both can serve as an underlying base for traditional and new applications. Therefore, it should not be a surprise that there is so much confusion about what P2P is and what it is not. It is extremely intertwined with existing technologies [Morgan 2002].
The P2P model is quite broad and it could be evaluated from different perspectives. Figure categorizes the scope of P2P development and deployment. In terms of development, platforms provide an infrastructure to support P2P applications. Additionally, developers are beginning to explore the benefit of implementing various horizontal technologies such as distributed computing, collaborative, and content sharing software using the P2P model rather than more traditional models such as client-server. Applications such as file sharing and messaging software are being deployed in a number of different vertical markets. Section 2 provides a more thorough evaluation of P2P markets and Section 5 describes the horizontal technologies in more detail.
The following three lists, which are summarized in Table 1, are an attempt to define the nature of P2P, what is and is not new in P2P. P2P is concerned with:
Figure: P2P Solutions. P2P can be classified into interoperable P2P platforms, applications of the P2P technology, and vertical P2P applications.
New aspects of P2P include:
Aspects of P2P that are not new include:
2. Core Concepts in Peer-to-Peer Networking
2.1. Introduction
Peer-to-peer (P2P) has become one of the most widely discussed terms in information technology (Schoder, Fischbach, &Teichmann, 2002; Shirky, Truelove, Dornfest, Gonze, & Dougherty, 2001). The term peer-to-peer refers to the concept that in a network of equals (peers) using appropriate information and communication systems, two or more individuals are able to spontaneously collaborate without necessarily needing central coordination (Schoder&Fischbach, 2003). In contrast to client/server networks, P2P networks promise improved scalability, lower cost of ownership, self-organized and decentralized coordination of previously underused or limited resources, greater fault tolerance, and better support for building ad hoc networks. In addition, P2P networks provide opportunities for new user scenarios that could scarcely be implemented using customary approaches.
This chapter is structured as follows: The first paragraph presents an overview of the basic principles of P2P networks. Further on, a framework is introduced which serves to clarify the various perspectives from which P2P networks can be observed: P2P infrastructures, P2P applications, P2P communities. The following paragraphs provide a detailed description of each of the three corresponding levels. First, the main challenges—namely, interoperability and security—of P2P infrastructures, which act as a foundation for the above levels, are discussed. In addition, the most promising projects in that area are highlighted. Second, the fundamental design approaches for implementing P2P applications for the management of resources, such as bandwidth, storage, information, files, and processor cycles, are explained. Finally, socioeconomic phenomena, such as free-riding and trust, which are of importance to P2P communities, are discussed. The chapter concludes with a summary and outlook.
2.2. P2P Networks: Characteristics and a Three-Level Model
The shared provision of distributed resources and services, decentralization and autonomy are characteristic of P2P networks (M. Miller, 2001; Barkai, 2001; Aberer&Hauswirth, 2002, Schoder&Fischbach, 2002; Schoder et al., 2002; Schollmeier, 2002):
Due to the fact that all components share equal rights and equivalent functions, pure P2P networks represent the reference type of P2P design. Within these structures there is no entity that has a global view of the network (Barkai, 2001, p. 15; Yang & Garcia-Molina, 2001). In hybrid P2P networks, selected functions, such as indexing or authentication, are allocated to a subset of nodes that as a result, assume the role of a coordinating entity. This type of network architecture combines P2P and client/server principles (Minar, 2001, 2002).
On the basis of these characteristics, P2P can be understood as one of the oldest architectures in the world of telecommunication (Oram, 2001). In this sense, the Usenet, with its discussion groups, and the early Internet, or ARPANET, can be classified as P2P networks. As a result, there are authors who maintain that P2P will lead the Internet back to its origins—to the days when every computer had equal rights in the network (Minar&Hedlund, 2001). Decreasing costs for the increasing availability of processor cycles, bandwidth, and storage, accompanied by the growth of the Internet have created new fields of application for P2P networks. In the recent past, this has resulted in a dramatic increase in the number of P2P applications and controversial discussions regarding limits and performance, as well as the economic, social, and legal implications of such applications (Schoder et al., 2002; Smith, Clippinger, &Konsynski, 2003). The three level model presented below, which consists of P2P infrastructures, P2P applications, and P2P communities, resolves the lack of clarity in respect to terminology, which currently exists in both theory and practice.
Level 1 represents P2P infrastructures. P2P infrastructures are positioned above existing telecommunication networks, which act as a foundation for all levels. P2P infrastructures provide communication, integration, and translation functions between IT components. They provide services that assist in locating
Figure: Levels of P2P networks
and communicating with peers in the network and identifying, using, and exchanging resources, as well as initiating security processes such as authentication and authorization.
Level 2 consists of P2P applications that use services of P2P infrastructures. They are geared to enable communication and collaboration of entities in the absence of central control.
Level 3 focuses on social interaction phenomena, in particular, the formation of communities and the dynamics within them. In contrast to Levels 1 and 2, where the term peer essentially refers to technical entities, in Level 3 the significance of the term peer is interpreted in a nontechnical sense (peer as person).
2.3. P2P Infrastructures
The term P2P infrastructures refer to mechanisms and techniques that provide communication, integration, and translation functions between IT components in general, and applications, in particular. The core function is the provision of interoperability with the aim of establishing a powerful, integrated P2P infrastructure. This infrastructure acts as a “P2P Service Platform” with standardized APIs and middleware which in principle, can be used by any application (Schoder&Fischbach, 2002; Shirky et al., 2001; Smith et al., 2003).
Among the services that the P2P infrastructure makes available for the respective applications, security has become particularly significant (Barkai,
2001). Security is currently viewed as the central challenge that has to be resolved if P2P networks are to become interesting for business use (Damker,
2002).
2.3.1 Interoperability
Interoperability refers to the ability of any entity (device or application) to speak to, exchange data with, and be understood by any other entity (Loesgen, n.d.). At present, interoperability between various P2P networks scarcely exists. The developers of P2P systems are confronted with heterogeneous software and hardware environments as well as telecommunication infrastructures with varying latency and bandwidth. Efforts are being made, however, to establish a common infrastructure for P2P applications with standardized interfaces. This is also aimed at shortening development times and simplifying the integration of applications in existing systems (Barkai, 2001; Wiley, 2001). In particular, within the World Wide Web Consortium (W3C) (W3C, 2004) and the Global Grid Forum (GGF, n.d.) discussions are taking place about suitable architectures and protocols to achieve this aim. Candidates for a standardized P2P infrastructure designed to ensure interoperability include JXTA, Magi, Web services, Jabber, and Groove (Baker, Buyya, &Laforenza, 2002; Schoder et al., 2002).
2.3.2 Security
The shared use of resources frequently takes place between peers that do not know each other and, as a result, do not necessarily trust each other. In many cases, the use of P2P applications requires granting third parties access to the resources of an internal system, for example, in order to share files or processor cycles. Opening an information system to communicate with, or grant access to, third parties can have critical side effects. This frequently results in conventional security mechanisms, such as firewall software, being circumvented. A further example is communication via instant messaging software. In this case, communication often takes place without the use of encryption. As a result, the security goal of confidentiality is endangered. Techniques and methods for providing authentication, authorization, availability checks, data integrity, and confidentiality are among the key challenges related to P2P infrastructures (Damker, 2002).
A detailed discussion of the security problems which are specifically related to P2P, as well as prototypical implementations and conceptual designs can be found in Barkai (2001), Damker (2002), Udell, Asthagiri, and Tuvell (2001), Groove Networks (2004), Grid Security (2004), Foster, Kesselman, Tsudic, and Tuecke (1998), and Butler et al. (2000).
2.4 P2P Applications: Resource Management Aspects
In the respective literature, P2P applications are often classified according to the categories of instant messaging, file sharing, grid computing and collaboration (Schoder&Fischbach, 2002; Shirky et al., 2001). This form of classification has developed over time and fails to make clear distinctions. Today, in many cases the categories can be seen to be merging. For this reason, the structure of the following sections is organized according to resource aspects, which in our opinion, are better suited to providing an understanding of the basic principles of P2P networks and the way they function. Primary emphasis is placed on providing an overview of possible approaches for coordinating the various types of resources, that is, information, files, bandwidth, storage, and processor cycles in P2P networks.
2.4.1 Information
The following sections explain the deployment of P2P networks using examples of the exchange and shared use of presence information, of document management, and collaboration.
The use of presence information is interesting for the shared use of processor cycles and in scenarios related to omnipresent computers and information availability (ubiquitous computing). Applications can independently recognize which peers are available to them within a computer grid and determine how intensive computing tasks can be distributed among idle processor cycles of the respective peers. Consequently, in ubiquitous computing environments it is helpful if a mobile device can independently recognize those peers which are available in its environment, for example in order to request Web Services, information, storage or processor cycles. The technological principles of this type of communication are discussed in Wojciechowski and Weinhardt (2002).
In addition to linking distributed data sources, P2P applicationa can offer services for the aggregation of information and the formation of selforganized P2P knowledge networks. Opencola (Leuf, 2002) was one of the first P2P applications that offer their users the opportunity to gather distributed information in the network from the areas of knowledge that interest them. For this purpose, users create folders on their desktop that are assigned keywords that correspond to their area of interest. Opencola then searches the knowledge network independently and continuously for available peers that have corresponding or similar areas of knowledge without being dependent on centrally administered information. Documents from relevant peers are analyzed, suggested to the user as appropriate, and automatically duplicated in the user’s folder. If the user rejects respective suggestions, the search criteria are corrected. The use of Opencola results in a spontaneous networking of users with similar interests without a need for a central control.
2.4.2 Files
File sharing is probably the most widespread P2P application. It is estimated that as much as 70% of the network traffic in the Internet can be attributed to the exchange of files, in particular music files (Stump, 2002) (more than one billion downloads of music files can be listed each week [Oberholzer&Strumpf, 2004]). Characteristic of file sharing is that peers that have downloaded the files in the role of a client subsequently make them available to other peers in the role of a server. A central problem for P2P networks in general, and for file sharing in particular, is locating resources (lookup problem) (Balakrishnan, Kaashoek, Karger, Morris, &Stoica, 2003). In the context of file sharing systems, three different algorithms have developed: the flooded request model, the centralized directory model, and the document routing model (Milojicic et al., 2002). These can be illustrated best by using their prominent implementations—Gnutella, Napster, and Freenet.
P2P networks that are based on the Gnutella protocol function without a central coordination authority. All peers have equal rights within the network. Search requests are routed through the network according to the flooded request model which means that a search request is passed on to a predetermined number of peers. If they cannot answer the request, they pass it on to other nodes until a predetermined search depth (ttl = time-to-live) has been reached or the requested file has been located. Positive search results are then sent to the requesting entity which can then download the desired file directly from the entity that is offering it. A detailed description of searches in gnutella networks, as well as an analysis of the protocol, can be found in Ripeanu, Foster, and Iamnitchi (2002) and Ripeanu (2001). Due to the fact that the effort for the search, measured in messages, increases exponentially with the depth of the search, the inefficiency of simple implementations of this search principle is obvious. In addition, there is no guarantee that a resource will actually be located. Operating subject to certain prerequisites (such as nonrandomly structured networks), numerous prototypical implementations (for example, Aberer et al., 2003; Crowcroft& Pratt, 2002; Dabek et al., 2001; Druschel&Rowstron, 2001; Nejdl, et al. 2003; Pandurangan&Upfal, 2001; Ratnasamy, Francis, Handley, Karp, &Shenker, 2001; Lv, Cao, Cohen, Li, &Shenker, 2002; Zhao et al., 2004) demonstrate how searches can be effected more “intelligently” (see, in particular, Druschel, Kaashoek, &Rowstron [2002], and also Aberer&Hauswirth [2002] for a brief overview). The FastTrackprotocol enjoys widespread use in this respect. It optimizes search requests by means of a combination of central supernodes which form a decentralized network similar to Gnutella.
In respect of its underlying centralized directory model, the early Napster (Napster, 2000) can be viewed as a nearly perfect example of a hybrid P2P system in which a part of the infrastructure functionality, in this case the index service, is provided centrally by a coordinating entity. The moment a peer logs into the Napster network, the files that the peer has available are registered by the Napster server. When a search request is issued, the Napster server delivers a list of peers that have the desired files available for download. The user can obtain the respective files directly from the peer offering them.
Searching for and storing files within the Freenet network (Clarke, Miller, Hong, Sandberg, & Wiley, 2002; Clarke, 2003) takes place via the so-called document routing model (Milojicic et al., 2002). A significant difference to the models that have been introduced so far is that files are not stored on the hard disk of the peers providing them, but are intentionally stored at other locations in the network. The reason behind this is that Freenet was developed with the aim of creating a network in which information can be stored and accessed anonymously. Among other things, this requires that the owner of a network node does not know what documents are stored on his/her local hard disk. For this reason, files and peers are allocated unambiguous identification numbers. When a file is created, it is transmitted, via neighboring peers, to the peer with the identification number that is numerically closest to the identification number of the file and is stored there. The peers that participate in forwarding the file save the identification number of the file and also note the neighboring peer to which they have transferred it in a routing table to be used for subsequent search requests.
The search for files takes place along the lines of the forwarding of search queries on the basis of the information in the routing tables of the individual peers. In contrast to searching networks that operate according to the flooded request model, when a requested file is located, it is transmitted back to the peer requesting it via the same path. In some applications, each node on this route stores a replicate of the file in order to be able to process future search queries more quickly. In this process, the peers only store files up to a maximum capacity. When their storage is exhausted, files are deleted according to the least recently used principle. This results in a correspondingly large number of replicates of popular files being created in the network, whereas, over time, files that are requested less often are removed (Milojicic et al., 2002). In various studies (Milojicic et al., 2002), the document routing model has been proven suitable for use in large communities. The search process, however, is more complex than, for example, in the flooded request model. In addition, it can result in the formation of islands—that is, a partitioning of the network in which the individual communities no longer have a connection to the entire network (Clarke et al., 2002; Langley, 2001).
2.4.3 Bandwidth
Due to the fact that the demands on the transmission capacities of networks are continuously rising, in particular due to the increase in large-volume multimedia data, effective use of bandwidth is becoming more and more important. Currently, in most cases, centralized approaches in which files are held on the server of an information provider and transferred from there to the requesting client are primarily used. In this case, a problem arises when spontaneous increases in demand exert a negative influence on the availability of the files due to the fact that bottlenecks and queues develop.
Without incurring any significant additional administration, P2P-based approaches achieve increased load balancing by taking advantage of transmission routes that are not being fully exploited. They also facilitate the shared use of the bandwidth provided by the information providers.
2.4.4 Storage
In order to participate in a P2P storage network, each peer receives a public/private key pair. With the aid of a hash function, the public key is used to create an unambiguous identification number for each peer. In order to gain access to storage on another computer, the peer has to either make available some of its own storage, or pay a fee. Corresponding to its contribution, each peer is assigned a maximum volume of data that it can add to the network. When a file is to be stored in the network, it is assigned an unambiguous identification number, created with a hash function from the name or the content of the respective file, as well as the public key of the owner. Storing of the file and searching for it in the network take place in the manner described as the document routing model before. In addition, a freely determined number of file replicates are also stored. Each peer retrieves its own current version of the routing table which is used for storage and searches. The peer checks the availability of its neighbors at set intervals in order to establish which peers have left the network. In this way, new peers that have joined the network are also included in the table.
2.4.5 Processor Cycles
Recognition that the available computing power of the networked entities was often unused was an early incentive for using P2P applications to bundle computing power. At the same time, the requirement for high-performance computing, that is, computing operations in the field of bio-informatics, logistics, or the financial sector, has been increasing. By using P2P applications to bundle processor cycles, it is possible to achieve computing power that even the most expensive supercomputers can scarcely provide. This is effected by forming a cluster of independent, networked computers in which a single computer is transparent and all networked nodes are combined into a single logical computer. The respective approaches to the coordinated release and shared used of distributed computing resources in dynamic, virtual organizations that extend beyond any single institution, currently fall under the term grid computing (Baker et al., 2002; Foster, 2002; Foster &Kesselman, 2004; Foster, Kesselman, &Tuecke, 2002; GGF, n.d.). The term grid computing is an analogy to customary power grids. The greatest possible amount of resources, in particular computing power, should be available to the user, ideally unrestricted and not bound to any location—similar to the way in which power is drawn from an electricity socket. The collected works of Bal, Löhr, and Reinefeld (2002) provide an overview of diverse aspects of grid computing.
3. The Peer-to-Peer Architecture
3.1. Introduction
In this chapter, we look at the general architecture of a peer-to-peer system and contrast it with the traditional client-server architecture that is ubiquitous in current computing systems. We then compare the relative merits and demerits of each of these approaches toward building a distributed system.
We begin the chapter with a discussion of the client-server and peer-to-peer computing architectures. The subsequent subsections look at the base components that go into making a peer-to-peer application, finally concluding with a section that compares the relative strengths and weaknesses of the two approaches.
3.2. Distributed Applications
A distributed application is an application that contains two or more software modules that are located on different computers. The software modules interact with each other over a communication network connecting the different computers. To build a distributed application, you would need to decide how many software modules to include in the application, how to place these software modules on the different computers in the network, and how each software module discovers the other modules it needs to communicate with. There are many other tasks that must be done to build a distributed application, but those mentioned above are the key tasks to explain the difference between client-server computing and peer-to-peer computing.
3.2.1 A Distributed Computing Example
The different approaches to distributed computing can be explainedbest by means of an example. Suppose you are given thetask of creating a simulation of the movement of the Sun, theEarth, and the Moon by a team of five astronomers. Each of thefive astronomers has a computer on which he or she would like tosee the motion and position of the three heavenly bodies at anygiven time for the last 2000 years as well as the next 2000 years.Let us say (purely for the sake of illustration, rather than as thepreferred way to write such simulation) that the best way to solvethis problem is to create a large database of records, each recordcontaining the relative positions of the three bodies at differenttimes ranging over the entire 4000-year period. To show the positionsof the three heavenly bodies, the program will find the appropriateset of records and display the position visually on thecomputer screen. Even after making this choice on how to writethe program, you, as the programmer assigned to the task, havemultiple ways to develop and deploy the software.
In real life, one could also use a hybrid approach that is a mixturebetween the client-server architecture and the peer-to-peerarchitecture. The hybrid approach places some software moduleson a set of computers that can act as servers and others act asclients. The hybrid approach for some distributed applicationscan often result in a better trade-off between the ease of softwaremaintenance, scalability, and reliability.
For any of the approaches selected, you would need to solve thediscovery problem. The different modules of the application needto communicate with each other, and a prerequisite for this wouldbe that the modules know where to send messages to the othermodules. In the Internet, messages are sent to other applicationsby specifying their network address, which consists of the IP addressof the application and the port numbers on which the applicationis receiving messages. To communicate over the Internetprotocol suite, each software module must find out the networkaddress of the other software module (or modules).
One solution to the discovery process is to fix the port numbersfor all the software modules that they will be using and have allthe modules know the port numbers and IP addresses of the differentmodules
One of the key advantages of the client-server architecture (approach
II discussed above) is that it makes the discovery processquite simple. This enables the deployment of a large number ofclients and a high degree of scalability. Let us now define theclient-server architecture and the peer-to-peer computing architecturein a more precise manner and then examine the discoveryprocess in each of the architectures.
3.2.2 Client-Server Architecture
The client-server architecture is a way to structure a distributedapplication so that it consists of two distinct software modules:
The only communication in the system is between the client modulesand the server module.
In the client-server architecture, the server is usually the morecomplex piece of the software. The clients are often (although notalways) simpler. With the wide availability of a web browser onmost desktops, it is quite common to develop distributed applicationsso that they can use a standard web browser as the client. Inthis case, no effort is needed todevelop or maintain the client (or,rather, the effort has been taken over by a third party—the developerof the web browser). This simplifies the task of maintaining and upgrading the application software.
Figure: Client-Server architecture
The solution used for discovery in the client-server architectureis quite simple. The server runs on a port and network addressthat is known to the client module. The clients connect to theserver on this well-known network address. Once the client connectsto the server, the client and server are able to communicatewith each other. The server need not be configured with any informationabout the clients. This implies that the same server modulecan communicate with any number of clients, constrained onlyby the physical resources needed to provide a reasonable responsetime to all of the connected clients.
The simplicity and ease of maintenance of client-server architectureare the key reasons for its widespread usage in the designof distributed applications at the present time. However, theclient-server architecture has one drawback- It does not utilizethe computing power of the computers running the client modulesas effectively as it does the computing power of the server module.
3.2.3 Peer-to-Peer Architecture
The peer-to-peer architecture is a way to structure a distributedapplication so that it consists of many identical software modules,each module running on a different computer. The different softwaremodules communicate with each other to complete the processingrequired for the completion of the distributed application.
One could view the peer-to-peer architecture as placing a servermodule as well as a client module on each computer. Thus eachcomputer can access services from the software modules on anothercomputer, as well as providing services to the other computer.However, it also implies that the discovery process in the peer-to-peerarchitecture is much more complicated than that of theclient-server architecture. Each computer would need to know thenetwork addresses of the other computers running the distributedapplication, or at least of that subset of computers with which itmay need to communicate. Furthermore, propagating changes tothe different software modules on all the different computerswould also be much harder. However, the combined processingpower of several large computers could easily surpass the processingpower available from even the best single computer, and thepeer-to-peer architecture could thus result in much more scalableapplications.
Figure: Peer-to-Peer Architecture
3.3 The Peer-to-Peer Software Structure
As mentioned above, a distributed application implemented in apeer-to-peer fashion would have the same software module runningon all of the participating modules. Given the complexity associatedwith discovering, communicating, and managing thelarge number of computers involved in a distributed system,the software module is typically structured in a layered manner.The software of most peer-to-peer applications can be divided intothe three layers, the base overlay layer, the middleware layer,and the application layer.The base overlay layer deals with the issue of discovering otherparticipants in the peer-to-peer system and creating a mechanismfor all the nodes to communicate with each other. This layer is responsiblefor ensuring that all the participants in the nodes areaware of the other participants. Functions provided by the baselayer are the minimum functionality required for a peer-to-peerapplication to run.
The middleware layer includes additional software componentsthat could be potentially reused by many different applications.The term “middleware” is used to refer to software componentsthat are primarily invoked by other software components andused as a supporting infrastructure to build other applications.Functions included in this layer include the ability to create a distributedindex for information in the system, providing a publishsubscribefacility, and security services. The functions provided inthe middleware level are not necessary for all applications, butthey are developed to be reused by more than one application.
Finally, the application layer provides software packages intendedto be used by human users and developed so as to exploitthe distributed nature of the peer-to-peer infrastructure.
3.3.1 Base Overlay Layer
As mentioned above, the base overlay formation is a feature that must be provided by all peer-to-peer systems. The functions included in this layer include the following:
3.4.2 Middleware Functions
The middleware layer is responsible for providing some commonfunctions that will be used by applications at the higher layer.The middleware consists of those software functions that are intendedto be used primarily by other software components, ratherthan by a human user. The middleware function in itself cannotbe used to build a complete application, but the common functionscan be used to build peer-to-peer applications rapidly.
Some of the functions included in this layer are:
3.4.3 Application Layer
We define this layer as consisting of the software componentsthat are designed to be used primarily by a human user. The file sharingapplication is the most ubiquitous peer-to-peer application, with multiple implementations available from a large numberof providers. The file-sharing application allows users of apeer-to-peer network to find files of interest from other computerson the network and to download them locally. The use of this applicationfor sharing copyrighted content has been the subject ofseveral legal cases between developers of peer-to-peer softwareand the music industry.
File sharing, however, is not the only application that can exploitthe properties of a distributed base overlay infrastructure. Apeer-to-peer infrastructure can be used to support self-managingwebsites, assist users to surf the network in an anonymous manner,and provide highly scalable instant messaging services and ahost of other common applications.
There are some old applications which were built and developedwith the peer-to-peer model long before the file-sharing applicationgrew in prominence. These applications include somerouting protocols used within the Internet infrastructure as wellas the programs used to provide discussion and distributed newsgroupson the Internet.
4. Comparison of Architecture
4.1. Introduction
If you had to implement an application and had the choice of implementingit with a peer-to-peer architecture or a client-serverarchitecture, which one would you pick? Either of the two approachesto building the application can be made to work inmost cases. In this section, we look at some of the issues youshould consider when deciding which of the two approacheswould be more appropriate for the task at hand. Each of the subsectionsdiscusses some of the issues you may want to consider andthe merits and demerits of the two architectures compared witheach other.
4.2 Ease of Development
When building an application, you need to consider how easy ordifficult it will be to build and test the software for the application.The task of developing the software is eased by the existenceof development and debugging tools that can be used to hastenthe task of developing the application.
For developing client-server applications, there are a largenumber of application programs that are available to ease thetask of development. Many software components, such as webservers, web-application servers, and messaging software, areavailable from several vendors and provide infrastructure thatcan be readily used to provide a server-centric solution.
.
4.3 Manageability
Manageability refers to the ease of managing the final softwarewhen it is finally deployed. After software application is up andready, it still needs ongoing maintenance while in operation.Maintenance includes tasks such as ensuring that the applicationhas not stopped working (and restarting it in case it stops working),making backup copies of the data generated by the application,applying software upgrades, fixing any bugs that are discovered,educating users about the application, and a variety of otherfunctions.
In a peer-to-peer application, the application is running on severaldifferent machines that could be distributed across a wide geographicarea. If they are all under a single administrative domain,it is possible to standardize on a common platform for all ofthem. However, it is more common to find the situation in whichthe different components of a peer-to-peer application would runon different platforms. This makes the manageability of peer-to-peerapplications much harder.
4.4 Scalability
The scalability of an application is measured in terms of the highestrate or size of user-level interactions that the application cansupport with a reasonable performance. The quantity in whichscalability is measured is determined by the type of application.The scalability of a web server can be measured by the number ofuser requests it can support per second; the scalability of a directoryserver can be measured by the number of operations it cansupport per second, as well as by the maximum number of recordsit can store while maintaining a reasonable performance, for example,the maximum number of records it can store while keepingthe time for a lookup operation below 100 ms.
Peer-to-peer applications use many computers to solve a problemand thus are likely to provide a more scalable solution than aserver-centric solution, which relies on a single computer to performthe equivalent task. In general, using multiple computerswould tend to improve the scalability of the application comparedwith using only a single computer.
The scalability of client-server computing as well as peer-to-peercomputing has been proven by experience. Web servers ofpopular websites such as cnn.com or yahoo.com can handle millionsof requests each day on a routine basis. Similarly, the largenumber of files exchanged on existing peer-to-peer networks suchas gnutella and kazaa is on the order of millions of files very day.However, there is one difference between the scalability of thecentralized server solution and the peer-to-peer solution
4.5 Administrative Domains
One of the key factors determining how to structure the applicationwould depend on the usage pattern of the application andhow the different computers that are used to deploy the applicationsoftware are going to be administered. In general, with aclient-server approach, the server computers need to be under asingle administrative domain
A peer-to-peer system, however, can often be created by usingcomputers from many different administrative domains. Thus, ifusage of the software requires that computers from many differentadministrative domains be used, the peer-to-peer approachwould be the natural choice for that application.
4.6 Security
Once an application has been deployed, one of the administrativetasks associated with it is to manage its security. Security managemententails the tasks of making sure that the system is onlyaccessed by authorized users, that user credentials are authenticated,and that malicious users of the system do not plant virusesor Trojan horses on the system.
Security issues and vulnerabilities have been studied comprehensively in server-centric solutions, and safeguards against themost common types of security attacks have been developed. Ingeneral, the security of a centralized system can be managedmuch more readily than the security of a distributed infrastructure.In a distributed infrastructure, the security apparatus andmechanisms must be replicated at multiple sites as opposed to asingle site. This increases the cost of providing the security infrastructure
4.7 Reliability
The reliability of a system is measured by its ability to continueworking when one or more of its components fail. In the context ofcomputer systems, reliability is often measured in terms of theability of the system to run when one or more computers hostingthe system are brought down. The approach used for reliability inmost computers is to provide for redundant components, havingmultiple computers do the task instead of a single computer, suchas having standbys that can be activated when a primary computerfails.
High-reliability computer applications can be developed by usingeither client-server or peer-to-peer architectures. The solutionfor scalability for high-volume servers also provides for increasedreliability and continued operation in case one of the servers fails.Distributed peer-to-peer systems, for most applications, use multiplecomputers to do identical tasks, and thus the system continuesto be operational and available, even when a single computerfails or goes off-line. The most popular peer-to-peer networks aremade up of thousands of computers. Although each computer initself is a simple desktop and goes out of service frequently (whenusers switch off their machines), the entire system keeps on functioningwithout interruption.
6. Summary and Future Work
6.1. Introduction
In this report, we surveyed the field of P2P systems. Wedefined the field through terminology, architectures,goals, components, and challenges. We also introducedtaxonomies for P2P systems, applications, and markets.
Based on this information, we summarized P2P systemcharacteristics. Then, we surveyed different P2P systemcategories, as well as P2P markets. Out of systems presentedin Section 5, we selected eight case studies anddescribed them in more detail. We also compared thembased on the characteristics we introduced.Based on this information, we derived some lessonsabout P2P applications and systems.In the rest of this section, we revisit what P2P is, we explainwhy we think that P2P is an important technology,and finally we present the outlook for the P2P future.
6.2. Final Thoughts on What P2P Is
One of the most contentious aspects in writing this paperwas to define what P2P is and what it is not. Even aftercompleting this effort, we do not feel compelled to offera concise definition and a recipe of what P2P is and whatit is not. A simple answer is that P2P is many things tomany people and it is not possible to come up with a simplifiedanswer. P2P is a mind set, a model, an implementationchoice, and property of a system or anenvironment.
6.3. Why We Think P2P is Important
As P2P becomes more mature, its future infrastructureswill improve. There will be increased interoperability,more connections to the (Internet) world, and more robustsoftware and hardware. Nevertheless, some inherentproblems will remain. P2P will remain an important approachfor the following reasons.
6.4. P2P in the Future
The authors of this paper believe that there are at leastthree ways in which P2P may have impact in the future:
6.5. Summary
We believe that P2P is an important technology that hasalready found its way into existing products and researchprojects. It will remain an important solution to certaininherent problems in distributed systems. P2P is not a solutionto every problem in the future of computing. Alternativesto P2P are traditional technologies, such ascentralized systems and the client-server model. Systemsand applications do not necessarily have to be monolithic;they can participate in different degrees in the centralized/client-server/P2P paradigms. P2P will continue tobe a strong alternative for scalability, anonymity, andfault resilience requirements. P2P algorithms, applications,and platforms have an opportunity for deploymentin the future. From the market perspective, cost of ownershipmay be the driving factor for P2P. The strong presenceof P2P products indicates that P2P is not only aninteresting research technology but also a promisingproduct base.
)
8. Conclusion
The peer-to-peer concept has been around for some time. It became recently more popular by the advent of peer-to-peer applications for content and information sharing (e.g. Napster, Gnutella) and distributed processing and resource sharing (e.g. SETI@home). These applications exposed how many resources idle and unused computers currently waste. This has attracted considerable attention and it is thought that the potential of such systems is immense. However, it has to be distinguished between resource utilization through such systems and the actual technology and concepts that enable distributed computing in this context.
Peer-to-peer is a very heterogeneous work and research area. Peer-to-peer concepts can be found at the application layer, in distributed systems and within the communication sub-system. There are no major research or standardization initiatives that look at all aspect related to peer-to-peer technology and computing. The major application areas for peer-to-peer methods and technologies are Internet and Web based applications GIRD computing and distributed systems, data sharing, and collaboration. Peer-to-peer concepts are used because they increase scalability (when the known restrictions are taken into account), can improve performance and are inherently faults tolerant. Peer-to-peer systems are flexible and dynamic.
The term peer-to-peer is defined by its usage in different contexts and no formal definition exists. The application areas of peer-to-peer concepts are also too heterogeneous to clearly define a fixed set of attributes peer-to-peer systems have to adhere to. However, there are a number of characteristics many peer-to-peer systems share to realize the advantages of peer-to-peer computing. Hence they are characterized by how many of them they implement rather than by a specific well defined (sub-) set.
9. Bibliography