Published using Google Docs
WorkSheet 4
Updated automatically every 5 minutes

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN


Lab Distributed Big Data Analytics

Worksheet-4: Spark ML and SANSA


Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann

Nov 14, 2017

In this lab we are going to perform basic Spark ML and SANSA operations (described on “Spark Fundamentals II (MLlib - GraphX)” and “SANSA”).

In this lab, you will use MLlib and SANSA to find out the subject distribution over nt file. The purpose is to demonstrate how to use the Spark MLlib and SANSA using Spark.


IN CLASS


  1. Spark ML
  1. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and uploaded on HDFS under /yourname folder you may need to create an RDD out of this file.
  2. First create a Scala class Triple containing information about a triple read from a file, which will be used as schema. Since the data is going to be type of .nt file which inside contains rows of triples in format <subject> <predicate> <object> we may need to transform this data into a different format of representation. Hint: Use map function.
  3. Create an RDD of a Triple object
  4. Use the filter transformation to return a new RDD with a subset of the triples on the file by checking if the first row contains “#”, which on .nt file represent a comment.
  5. Implement TF-IDF algorithm for finding the most used classes on the given dataset.
  1. TF-IDF in Spark uses
  1. TF: HashingTF is a Transformer which takes sets of terms and converts those into fixed-length feature vectors.
  2. IDF:IDF is an Estimator which fits on a dataset and produces an IDFModel which takes feature vectors and scales each column.
  1. List classes into a separate RDD.
  2. Split each label into words using Tokonizer. For each sentence we use Hashing TF to hash the sentence into a feature vector. And then use IDF to rescale the feature vectors.
  3. Pass this feature vector into a learning algorithm.
  1. Collect and print the results.
  1. SANSA
  1. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and uploaded on HDFS under /yourname folder you may need to create an RDD out of this file.
  2. Read a RDF file by using SANSA and retrieve a Spark RDD representation of it.
  3. Read an OWL file by using SANSA and retrieve a Spark RDD/DataSet representation of it.
  4. Use SANSA-Inference layer in order to apply some inference/reasoning over RDF file by applying RDFS profile reasoner.

AT HOME


  1. SANSA-Notebook
  1. Run SANSA-Examples using SANSA-Notebooks and perform this example:
  1. Read an file into RDD representation
  2. Apply Property Distribution and Class Distribution using SANSA functions
  3. Use the same graph and apply RDFS reasoner using SANSA-Inference
  4. Apply Class distribution to the inferred graph above (see iii)
  5. Rank resources to the inferred graph and show top 20 most ranked entities.
  1. Read and explore
  1. Spark Machine Learning Library (MLlib) Guide
  2. SANSA Overview and SANSA FAQ.
  1. Further readings
  1. MLlib: Machine Learning in Apache Spark
  2. Distributed Semantic Analytics using the SANSA Stack by Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen in Proceedings of 16th International Semantic Web Conference – Resources Track (ISWC’2017), 2017.
  3. The Tale of Sansa Spark by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngomo Ngonga, and Hajira Jabeen in Proceedings of 16th International Semantic Web Conference, Poster & Demos, 2017.