1 of 25

Pyladies 業界分享 - Data Scientist

2 of 25

潘玫樺

清華大學電資學院學士班

清華大學資訊工程研究所

目前是 的資料科學家

3 of 25

讓我們一起用好設計實現美感生活

4 of 25

搜尋系統 (search system)

站上商品熱門排序 (ranking system)

推薦系統 (recommendation system)

5 of 25

6 of 25

7 of 25

credit: xkcd.com/1838/

Garbage In,

Garbage Out

8 of 25

9 of 25

10 of 25

11 of 25

numpy

scipy

pandas

sqlalchemy

scikit-learn

click

Apache Airflow

Apache Superset

pyspark

pyhive

Data Preparation

Model Development

other ml modules..

Model Deployment

Monitor

poetry

dramtiq

12 of 25

numpy

scipy

pandas

sqlalchemy

pyspark

pyhive

Data Preparation

pyspark

  • Spark python API
  • Cluster computing system
  • Need to understand spark to program

pyhive + sqlalchemy (ORM)

  • Hive is a data warehouse using Hive QL (query language) and query would be executed via MapReduce
  • Leverage sqlalchemy-ORM (Oriented Relational Mapping) to generate complicated SQL for us.
  • Hive Build-in Functions provided lots of convenience!
  • Not recommend! Pyhive library lacks of lots of functionalities.

13 of 25

numpy

scipy

pandas

scikit-learn

Model Development

other ml modules..

pipenv

pipenv + Jupyter Notebook

  • pipenv to setup working environment
  • Add multiple Ipython kernels for different virtualenvs into Jupyter for better project management

# install self�pipenv install -e .

# add kernels of current environment�pipenv shell�python -m ipykernel install --user --name=my-virtualenv-name��# to check whethter installed successfully�jupyter kernelspec list

14 of 25

15 of 25

numpy

scipy

pandas

scikit-learn

Model Development

other ml modules..

pipenv

numpy + scipy

  • scientific computing for python
  • high performance!
  • scipy.stats has lots of useful statistical functions

pandas

  • Based on numpy and scipy.
  • Flexible and powerful data analysis / manipulation library for Python
  • Providing labeled data structures similar to R data.frame objects, statistical functions, and much more.
  • Suggest finish reading tutorial and cookbook

scikit-learn

  • machine learning in Python!

16 of 25

click

Apache Airflow

Model Deployment

pipenv

Airflow

  • Using python to program workflow as a DAG (directed acyclic graph.)
  • Webpage user interface to easy visualize pipelines status and monitor progress.
  • Build-in operator and sensors for different usage.

17 of 25

18 of 25

Task Duration

19 of 25

Click

  • Python pacakge for creating command line tools.
  • automatically generate help page

Usage:

  • Different DAG to specify different kinds of workflow.
  • ExternalTaskSensor is a useful operator for creating dependecny of different DAGs.

click

Apache Airflow

Model Deployment

pipenv

Airflow + click + pipenv

  • pipenv to separate different project environment
  • click to write command line interface
  • Airflow BashOperator to execute commands.

20 of 25

21 of 25

Apache Airflow

Apache Superset

Monitor

Superset:

  • Provide rich set of data visualizaition.
  • Integrate with different kinds of database, including: mysql, postgreSQL, and also Hive.
  • Dashboard to monitor metrics.
  • SQL Lab to directly write SQL either for testing or quickly database look-up.

22 of 25

Simple Demo

23 of 25

24 of 25

Q & A

25 of 25

謝謝大家

mei-hua.pan@pinkoi.com