1 of 9

Sdata

Infrastructure & data for university research

Jan ŠEDIVÝ

Brno May 16th, http://nlp.fi.muni.cz/

2 of 9

Why Sdata

  • Seznam
    • Improve, innovate, differentiate
    • Test and research new ideas
    • Employ the best graduates

  • Universities need to work with industry
    • Students want to solve real problem
    • Universities need industry feedback
    • Access to data

3 of 9

Research requires data

Research requires data in many different forms

  • Big Data, Actual, Interactive, Streamed,
  • Big Data difficult to move, to share
  • Big Data needs to come with the adequate infrastructure
  • Supervised models need labeled data

4 of 9

Evaluation & replicability required

  • Evaluation
    • Calibrators - takes long time, expensive. subjective.
    • A/B testing - difficult to test out of company.

  • Replicability
    • Requires sharing data and algorithms.
    • Do the results generalize. Overfitting?
    • Articles use proprietary data, reviewing criterias are problem.

5 of 9

Data is confidential

Data is the most valuable asset for all companies.

How to share data with academics and researchers?

    • Confidentiality, security rules, NDA conditions.
    • Data preparation, packaging, anonymization, formats, labeling, ...
    • No standards (API), formats for data sharing.
    • Big Data - difficult to share, to move.
    • Computational infrastructure.

6 of 9

Research requires data

Problem:

    • How to make the data accessible?
    • For academic and research teams?

Solution:

    • Sdata running on grid/cloud center.
    • Seznam provides data.
    • Seznam controls access.

7 of 9

Sdata uses CERIT

  • European Grid Infrastructure (EGI).
    • Links infrastructure of hundreds of independent research institutes, universities and organisations
    • May 2013, more than 333,000 CPU cores, 1.4Mjobs/day.�
  • National Grid infrastructure (NGI) is part of the of EGI
    • NGI is operated by MetaCentrum and Cesnet

8 of 9

Support Czech Universities

  • The Sdata is operated by Seznam, CERIT and ČVUT�
  • It supports the Czech universities in the fields of Information retrieval, machine learning and related disciplines.�
  • Access will be granted under NDA to researchers and students.�
  • The data sets will be continuously updated based on the research needs and availability.

9 of 9

Thanks