Themen des Seminars “Data Wrangling”
Artur Andrzejak, Lutz Büch
Sommersemester 2014, Institut für Informatik, Universität Heidelberg
Link zur Seminarseite: http://pvs.ifi.uni-heidelberg.de/teaching/ss2014/s-data-wrangling/
Teilnehmer
Vorname | Nachname | Thema | Block | Datum | Vorbesprechung | Ausarb. |
Jonas | Cordes | B1 | 1 | 18. Mai |
|
|
Daniel | Egenolf | A1 | 1 | 18. Mai |
|
|
Tobias | Limpert | A5 | 1 | 18. Mai |
|
|
Christian | Kromm | C6 | 1 | 18. Mai |
|
|
Claudia | Dünkel | A3 | 2 | 1. Juni |
|
|
Jonas | Scholten | C9 | 2 | 1. Juni |
|
|
Matthias | Iacsa | B2 | 1 | 1. Juni |
|
|
Hüseyin | Dagaydin | C3 | 2 | 1. Juni |
|
|
Stefan | Mücke | C4 | 2 | 1. Juni |
|
|
Özhan | Durgan | C1 | 2 | 30. Juni |
|
|
Teilnehmer
A. Introduction to Data Wrangling
A1 Data Quality
A2 String similarity
A3 Schema Matching and Schema Mapping
A4 Record Matching
A5 Data Fusion
B. Modern data-driven methods
B1 Active Learning
B2 Programming by Example and Program Synthesis
C. Application of data-driven methods in Data Wrangling
C1 Wrapper induction for Data Extraction
C2 Learning String Transformation From Examples
C3 Automating String Processing in Spreadsheets Using I/O-Examples
C4 Synthesizing Number Transformation from I/O-Examples
C5 Learning Semantic String Transformations from Examples*
C6 Interactive Deduplication using Active Learning
C7 Transformation-based Framework for Record Matching
C8 Adaptive Duplicate Detection Using Learnable String Similarity Measures
C9 DUMAS - Horizontal Schema Matching using Duplicates
C10 iMAP: discovering complex semantic matches between database schemas
C11 Sample-Driven Schema Mapping
A. Introduction to Data Wrangling
General literature:
- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Vol. 1. New York: Springer, 2007.
- U. Leser und F. Naumann, Informationsintegration : Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen, Heidelberg: Dpunkt-Verl., 2007.
- P. Christen, Data matching : concepts and techniques for record linkage, entity resolution, and duplicate detection, Berlin ; Heidelberg [u.a.]: Springer, 2012.
A1 Data Quality
- Rahm, Erhard, and Hong Hai Do. "Data cleaning: Problems and current approaches." IEEE Data Eng. Bull. 23.4 (2000): 3-13. (link)
- U. Leser und F. Naumann, Informationsintegration : Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen, Heidelberg: Dpunkt-Verl., 2007.
- Kapitel 8.1 Datenreinigung
- Kapitel 8.4 Informationsqualität
- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Vol. 1. New York: Springer, 2007.
- Chapter 2 What is Data Quality and Why Should We Care?
- Chapter 3.2 Data Quality and their Consequences
A2 String similarity
- William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73–78, 2003. (link)
- Cohen, William, Pradeep Ravikumar, and Stephen Fienberg. A comparison of string metrics for matching names and records. KDD Workshop on Data Cleaning and Object Consolidation. Vol. 3. 2003. (link)
- Navarro, Gonzalo. "A guided tour to approximate string matching." ACM computing surveys (CSUR) 33.1 (2001): 31-88. (link)
- Section 2 Main Application Areas
- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Vol. 1. New York: Springer, 2007.
- Chapter 11 Phonetic Coding Systems for Names
- Chapter 13 String Comparator Metrics for Typographical Error
- P. Christen, Data matching : concepts and techniques for record linkage, entity resolution, and duplicate detection, Berlin ; Heidelberg [u.a.]: Springer, 2012.
- Chapter 3.2 Issues with Names and Other Personal Information
- Chapter 3.3 Types and Sources of Variations and Errors in Names
- Chapter 4.3 (Phonetic) Encoding Functions
- Chapter 5.1-5.11
A3 Schema Matching and Schema Mapping
- schema-based
- instance-based
- U. Leser und F. Naumann, Informationsintegration : Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen, Heidelberg: Dpunkt-Verl., 2007.
- Kapitel 3.3.3-3.3.6 Heterogenität
- Kapitel 5.1-5.3
- Arnold, P. (2011). The Basics of Complex Correspondences and Functions and their Implementation and Semi-automatic Detection in COMA++. Leipzig. (link)
- Chapter 2 An Introduction in Schema Mapping
- Bilke, Alexander. Duplicate based schema matching. Diss. Berlin Institute of Technology, 2006. (link)
- Chapter 2 The Schema Matching Problem
A4 Record Matching
- Blocking, indexing
- comparison, similarity
- classification, clustering
- evaluation of indexing and matching
- transitive closure, constraints, costs
- P. Christen, Data matching : concepts and techniques for record linkage, entity resolution, and duplicate detection, Berlin ; Heidelberg [u.a.]: Springer, 2012.
- Part I Overview
- Chapter 4 Indexing (without 4.3)
- Chapter 6.1-6.9
- Chapter 7.1-7.2
- U. Leser und F. Naumann, Informationsintegration : Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen, Heidelberg: Dpunkt-Verl., 2007.
- Chapter 8.2 Duplikaterkennung
- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Vol. 1. New York: Springer, 2007.
- Chapter 3.4 Successful Application of Record Linkage
- Chapter 8 Record Linkage - Methodology
- Chapter 12 Blocking
A5 Data Fusion
- Bleiholder, Jens, and Felix Naumann. "Data fusion." ACM Computing Surveys (CSUR) 41.1 (2008): 1. (link)
- U. Leser und F. Naumann, Informationsintegration : Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen, Heidelberg: Dpunkt-Verl., 2007.
- Bleiholder, Jens, and Felix Naumann. "Conflict handling strategies in an integrated information system." (2006). (link)
- [Tutorial] Dong, Xin Luna, and Felix Naumann. "Data fusion: resolving data conflicts for integration." Proceedings of the VLDB Endowment 2.2 (2009): 1654-1655., (overview, full)
B. Modern data-driven methods
B1 Active Learning
- Settles, Burr. "Active learning." Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1-114. (link)
B2 Programming by Example and Program Synthesis
- Lau, Tessa, et al. "Programming by demonstration using version space algebra." Machine Learning 53.1-2 (2003): 111-156. (link)
- Lau, Tessa A., Pedro Domingos, and Daniel S. Weld. "Version Space Algebra and its Application to Programming by Demonstration." ICML. 2000. (link)
- Gulwani, Sumit. "Dimensions in program synthesis." Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming. ACM, 2010. (link)
- Menon, Aditya, et al. "A machine learning framework for programming by example." Proceedings of The 30th International Conference on Machine Learning. 2013. (link)
- Lieberman, Henry. Your wish is my command: Programming by example. Morgan Kaufmann, 2001. (link)
- Gulwani, Sumit. "Synthesis from examples: Interaction models and algorithms." Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2012 14th International Symposium on. IEEE, 2012. (link)
C. Application of data-driven methods in Data Wrangling
C1 Wrapper induction for Data Extraction
- Knoblock, Craig A., et al. "Accurately and reliably extracting data from the web: A machine learning approach." Intelligent exploration of the web. Physica-Verlag HD, 2003. 275-287. (link)
C2 Learning String Transformation From Examples
- Arasu, Arvind, Surajit Chaudhuri, and Raghav Kaushik. "Learning string transformations from examples." Proceedings of the VLDB Endowment 2.1 (2009): 514-525. (link)
C3 Automating String Processing in Spreadsheets Using I/O-Examples
- Gulwani, Sumit. "Automating string processing in spreadsheets using input-output examples." ACM SIGPLAN Notices. Vol. 46. No. 1. ACM, 2011. (link)
C4 Synthesizing Number Transformation from I/O-Examples
- Singh, Rishabh, and Sumit Gulwani. "Synthesizing number transformations from input-output examples." Computer Aided Verification. Springer Berlin Heidelberg, 2012. (link)
C5 Learning Semantic String Transformations from Examples*
- Singh, Rishabh, and Sumit Gulwani. "Learning semantic string transformations from examples." Proceedings of the VLDB Endowment 5.8 (2012): 740-751. (link)
C6 Interactive Deduplication using Active Learning
- Sarawagi, Sunita, and Anuradha Bhamidipaty. "Interactive deduplication using active learning." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002. (link)
- Sarawagi, Sunita, et al. "Alias: An active learning led interactive deduplication system." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002. (link)
C7 Transformation-based Framework for Record Matching
- Arasu, Arvind, Surajit Chaudhuri, and Raghav Kaushik. "Transformation-based framework for record matching." Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 2008. (link)
- Arasu, Arvind, et al. "Incorporating string transformations in record matching." Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008. (link)
C8 Adaptive Duplicate Detection Using Learnable String Similarity Measures
- Bilenko, Mikhail, and Raymond J. Mooney. "Adaptive duplicate detection using learnable string similarity measures." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003. (link)
C9 DUMAS - Horizontal Schema Matching using Duplicates
- Bilke, Alexander, and Felix Naumann. "Schema matching using duplicates." Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005. (link)
- Bilke, Alexander. Duplicate based schema matching. Diss. Berlin Institute of Technology, 2006. (link)
C10 iMAP: discovering complex semantic matches between database schemas
- Dhamankar, Robin, et al. "iMAP: discovering complex semantic matches between database schemas." Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 2004. (link)
- Arnold, P. (2011). The Basics of Complex Correspondences and Functions and their Implementation and Semi-automatic Detection in COMA++. Leipzig. (link)
C11 Sample-Driven Schema Mapping
- Qian, Li, Michael J. Cafarella, and H. V. Jagadish. "Sample-driven schema mapping." Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012. (link)