Βιβλιοθήκη ΟΠΑ - Ψηφιακό Αποθετήριο

ΠΥΞΙΔΑ Ιδρυματικό Αποθετήριο
και Ψηφιακή Βιβλιοθήκη

Όνομα χρήστη
Κωδικός πρόσβασης

Συλλογές :	Ιδρυματικό Αποθετήριο ΟΠΑ / AUEB Institutional Repository Σχολή Επιστημών και Τεχνολογίας της Πληροφορίας / School of Informatics Τμήμα Πληροφορικής / Department of Informatics Μεταπτυχιακές Εργασίες / Postgraduate dissertations

Τίτλος :	Βελτίωση απόκρισης κατανεμημένων συστημάτων ανάλυσης γράφων με χρήση αλγορίθμων κατάτμησης πραγματικού χρόνου

Δημιουργός :	Βερδελής, Δημήτριος

Συντελεστής :	Κωτίδης, Ιωάννης (Επιβλέπων καθηγητής) Οικονομικό Πανεπιστήμιο Αθηνών, Τμήμα Πληροφορικής (Degree granting institution)

Τύπος :	Text

Φυσική περιγραφή :	63 p.

Γλώσσα :	en

Αναγνωριστικό :	http://www.pyxida.aueb.gr/index.php?op=view_object&object_id=6646

Περίληψη :	Η κατανομή γράφων (graph partitioning) στη γενική της μορφή είναι η διανομή των κόμβων (vertices) ενός γράφου σε έναν προκαθορισμένο αριθμό υπογράφων k με περίπου τον ίδιο αριθμό κορυφών σε κάθε υπογράφο. Αυτό ονομάζεται ισορροπημένος διαχωρισμός γράφων. Ο στόχος είναι να ελαχιστοποιηθεί ο αριθμός των ακμών που διατρέχουν δύο διαφορετικούς υπογράφους. Το πρόβλημα αυτό είναι NP-hard που σημαίνει ότι επιλύεται σε μη πολυωνυμικούς χρόνους, ακόμη και όταν η πλήρης πληροφορία είναι διαθέσιμη και χωρίς τους περιορισμούς των κατανεμημένων συστημάτων. Δεδομένων των περιορισμών των μεγάλων δεδομένων, είναι αδύνατο να χρησιμοποιηθούν αλγόριθμοι που θα απαιτούσαν πολλαπλές αναγνώσεις των δεδομένων για την επίλυση αυτού του προβλήματος.Τα προγράμματα που χρησιμοποιούνται ευρέως, όπως το framework επεξεργασίας γραφημάτων GraphX του Spark, δεν έχουν κάποια καλή λύση για αυτό το ζήτημα. Οι αλγόριθμοι κατανομής γραφημάτων που περιλαμβάνονται στις βιβλιοθήκες τους χρησιμοποιούν είτε τυχαίες προσεγγίσεις είτε χωρίζουν τους γράφους με βάση τα αναγνωριστικά (Ids) των κόμβων τους. Ωστόσο, δεν μπορεί να διασφαλιστεί ότι το αναγνωριστικό ενός κόμβου έχει οποιαδήποτε σημασιολογία και στην πραγματικότητα μπορεί να είναι εντελώς αυθαίρετο. Επίσης, ακόμη και αν η προσέγγιση κατατμήσεως ήταν επιτυχής και μείωνε τους χρόνους εκτέλεσης διάφορων υπολογισμών, όταν τα δεδομένα προστίθενται στο γράφο ή αφαιρούνται από αυτόν, ο προκύπτον γράφος δεν ξαναχωρίζεται σωστά σε partitions, κάτι που μπορεί να έχει εκ νέου μεγάλο υπολογιστικό κόστος.Για το λόγο αυτό, οι αλγόριθμοι διαχωρισμού γράφων συνεχούς ροής (streaming graph algorithms) είναι καλές λύσεις για αυτό το πρόβλημα. Αυτοί οι αλγόριθμοι προσπαθούν να κατατμήσουν τους κόμβους καθώς διαβάζουν τα αρχικά δεδομένα γραφήματος (ροή), τοποθετώντας τον κάθε κόμβο στο βέλτιστο διαμέρισμα-υπογράφο με ευριστικό τρόπο. Επίσης, δεδομένου ότι είναι αλγόριθμοι ροής μπορούν να τοποθετήσουν μελλοντικές προσθήκες στο γράφο στο βέλτιστο διαμέρισμα, φροντίζοντας να είναι πάντα χωρισμένος με βέλτιστο τρόπο. Για το λόγο αυτό ονομάζονται συχνά δυναμικοί αλγόριθμοι, καθώς συμπεριφέρονται καλά στις αλλαγές δεδομένων.Σε αυτή την εργασία δυο αλγόριθμοι διαχωρισμού γράφων, ο LDG και ο FENNEL ελέγχονται σχετικά με την ικανότητά τους να μειώνουν το χρόνο εκτέλεσης διάφορων επερωτημάτων σε διάφορα σύνολα δεδομένων που προέρχονται από πραγματικές βάσεις δεδομένων. The scale of data means that it is no longer possible to analyze data in a single machine and distributed computing has become almost synonymous to big data. For flat data and tables, partitioning and assigning the data to the different nodes in a distributed cluster is often done in a trivial manner since the MapReduce calculations do not usually take into account data in other partitions. However, when one tries to distribute graph data in a cluster, the partitioning of the graph becomes an important decision to be made.Graph partitioning in its general form consists of distributing the vertices of a graph in a preset number of k partitions with approximately the same number of vertices in each partition, which is termed balanced graph partitioning. The goal is to minimize number of cross partition edges (edge-cut). This problem is NP-hard meaning it is solved in non-polynomial times, even when given full information without the distributed computing constraints. Given the big data limitations, it is impossible to employ algorithms that would require multiple reads of the data to solve this.Industrial solutions, such as Spark’s GraphX graph processing framework do not have any good solution for this issue. The graph partitioning algorithms included in their libraries employ approaches that partition the graph edges and then duplicate the vertices to all the partitions. However, it cannot be guaranteed that the ID of a node has any semantics and in fact it may be entirely arbitrary. Also, even if the partitioning approach was successful and optimized execution times, when data is added to the graph or removed from it, the resulting graph is not repartitioned on the fly and the graph needs to be repartitioned, which is costly.For this reason, streaming graph partitioning algorithms are good candidates for this problem. These algorithms try to partition the vertices as they read the initial graph data (stream), placing each one heuristically to the best partition. Also, since they are streaming algorithms they can also place future additions to the graph in the optimal partition. For this reason they are often termed dynamic algorithm since they behave well to data changes.In this project two streaming graph partitioners LDG and FENNEL are being tested for their ability to decrease graph analytics processing time in various real-world datasets.LDG is an algorithm that tries to maximize a metric when placing each vertex to a partition. This can be maximizing the total number of neighbors or the total number of triangles

Περίληψη :

Η κατανομή γράφων (graph partitioning) στη γενική της μορφή είναι η διανομή των κόμβων (vertices) ενός γράφου σε έναν προκαθορισμένο αριθμό υπογράφων k με περίπου τον ίδιο αριθμό κορυφών σε κάθε υπογράφο. Αυτό ονομάζεται ισορροπημένος διαχωρισμός γράφων. Ο στόχος είναι να ελαχιστοποιηθεί ο αριθμός των ακμών που διατρέχουν δύο διαφορετικούς υπογράφους. Το πρόβλημα αυτό είναι NP-hard που σημαίνει ότι επιλύεται σε μη πολυωνυμικούς χρόνους, ακόμη και όταν η πλήρης πληροφορία είναι διαθέσιμη και χωρίς τους περιορισμούς των κατανεμημένων συστημάτων. Δεδομένων των περιορισμών των μεγάλων δεδομένων, είναι αδύνατο να χρησιμοποιηθούν αλγόριθμοι που θα απαιτούσαν πολλαπλές αναγνώσεις των δεδομένων για την επίλυση αυτού του προβλήματος.Τα προγράμματα που χρησιμοποιούνται ευρέως, όπως το framework επεξεργασίας γραφημάτων GraphX του Spark, δεν έχουν κάποια καλή λύση για αυτό το ζήτημα. Οι αλγόριθμοι κατανομής γραφημάτων που περιλαμβάνονται στις βιβλιοθήκες τους χρησιμοποιούν είτε τυχαίες προσεγγίσεις είτε χωρίζουν τους γράφους με βάση τα αναγνωριστικά (Ids) των κόμβων τους. Ωστόσο, δεν μπορεί να διασφαλιστεί ότι το αναγνωριστικό ενός κόμβου έχει οποιαδήποτε σημασιολογία και στην πραγματικότητα μπορεί να είναι εντελώς αυθαίρετο. Επίσης, ακόμη και αν η προσέγγιση κατατμήσεως ήταν επιτυχής και μείωνε τους χρόνους εκτέλεσης διάφορων υπολογισμών, όταν τα δεδομένα προστίθενται στο γράφο ή αφαιρούνται από αυτόν, ο προκύπτον γράφος δεν ξαναχωρίζεται σωστά σε partitions, κάτι που μπορεί να έχει εκ νέου μεγάλο υπολογιστικό κόστος.Για το λόγο αυτό, οι αλγόριθμοι διαχωρισμού γράφων συνεχούς ροής (streaming graph algorithms) είναι καλές λύσεις για αυτό το πρόβλημα. Αυτοί οι αλγόριθμοι προσπαθούν να κατατμήσουν τους κόμβους καθώς διαβάζουν τα αρχικά δεδομένα γραφήματος (ροή), τοποθετώντας τον κάθε κόμβο στο βέλτιστο διαμέρισμα-υπογράφο με ευριστικό τρόπο. Επίσης, δεδομένου ότι είναι αλγόριθμοι ροής μπορούν να τοποθετήσουν μελλοντικές προσθήκες στο γράφο στο βέλτιστο διαμέρισμα, φροντίζοντας να είναι πάντα χωρισμένος με βέλτιστο τρόπο. Για το λόγο αυτό ονομάζονται συχνά δυναμικοί αλγόριθμοι, καθώς συμπεριφέρονται καλά στις αλλαγές δεδομένων.Σε αυτή την εργασία δυο αλγόριθμοι διαχωρισμού γράφων, ο LDG και ο FENNEL ελέγχονται σχετικά με την ικανότητά τους να μειώνουν το χρόνο εκτέλεσης διάφορων επερωτημάτων σε διάφορα σύνολα δεδομένων που προέρχονται από πραγματικές βάσεις δεδομένων.
The scale of data means that it is no longer possible to analyze data in a single machine and distributed computing has become almost synonymous to big data. For flat data and tables, partitioning and assigning the data to the different nodes in a distributed cluster is often done in a trivial manner since the MapReduce calculations do not usually take into account data in other partitions. However, when one tries to distribute graph data in a cluster, the partitioning of the graph becomes an important decision to be made.Graph partitioning in its general form consists of distributing the vertices of a graph in a preset number of k partitions with approximately the same number of vertices in each partition, which is termed balanced graph partitioning. The goal is to minimize number of cross partition edges (edge-cut). This problem is NP-hard meaning it is solved in non-polynomial times, even when given full information without the distributed computing constraints. Given the big data limitations, it is impossible to employ algorithms that would require multiple reads of the data to solve this.Industrial solutions, such as Spark’s GraphX graph processing framework do not have any good solution for this issue. The graph partitioning algorithms included in their libraries employ approaches that partition the graph edges and then duplicate the vertices to all the partitions. However, it cannot be guaranteed that the ID of a node has any semantics and in fact it may be entirely arbitrary. Also, even if the partitioning approach was successful and optimized execution times, when data is added to the graph or removed from it, the resulting graph is not repartitioned on the fly and the graph needs to be repartitioned, which is costly.For this reason, streaming graph partitioning algorithms are good candidates for this problem. These algorithms try to partition the vertices as they read the initial graph data (stream), placing each one heuristically to the best partition. Also, since they are streaming algorithms they can also place future additions to the graph in the optimal partition. For this reason they are often termed dynamic algorithm since they behave well to data changes.In this project two streaming graph partitioners LDG and FENNEL are being tested for their ability to decrease graph analytics processing time in various real-world datasets.LDG is an algorithm that tries to maximize a metric when placing each vertex to a partition. This can be maximizing the total number of neighbors or the total number of triangles

Λέξη κλειδί :	Κατάτμηση γράφων FENNEL LDG Επιδόσεις συστημάτων big data Graph partitioning FENNEL LDG Graph benchmarks

Διαθέσιμο από :	2019-01-03 11:10:55

Ημερομηνία έκδοσης :	2018

Ημερομηνία κατάθεσης :	2019-01-03 11:10:55

Δικαιώματα χρήσης :	Free access

Άδεια χρήσης :

Αρχείο: Verdelis_2018.pdf

Τύπος: application/pdf

Είσοδος