|Accelerating Data Science & Scientific Computing in Python using Numba||“Use a screwdriver instead of a hammer. Try to untighten the nut with your hand. Utilize the path of least resistance first.” - Tim Allen|
The prominent reasons for the wide adoption of Python is the ease of learning, usability and readability coupled with the powerful ecosystem of Python packages. This often makes Python an attractive language for researchers & scholars to undertake computational projects and thesis. The ease of prototyping and tinkering also allows for higher number of iterations and customization in the project, leading to increase in research output. But, one of the pain points of Python is its speed when compared to languages like C++ or FORTRAN which are still widely used in research.
Scholars, when hit by the performance bottleneck of pure python code, often come across some methods to increase their code performance like using PyMPI, Numpy or CPython. But, the learning curve is steep as things get less familiar. If learning Python is so easy, why should increasing the performance of Python code be so difficult?
This workshop will address this question and introduce Numba, an open source JIT compiler that translates Python and NumPy code into fast machine code. The workshop attendees will delve into 4 real world computational exercises which will demonstrate the core concepts, ease and effectiveness of using Numba. Thereby showcasing how it can be useful in lowering the barrier to achieve code performance for data science and scientific computing using Python.
The workshop will be divided into the following sections:
1) Why Numba? - 15 mins
2) Exercise #1: Mandelbrot Set - 15 mins
3) Exercise #2: Solving Poisson Equation (Heat Transfer) - 30 mins
4) Exercise #3: Unsupervised learning analytics pipeline - 30 mins
5) Exercise #4: Neural Network - 25 mins
6) Closing Remarks and Q&A - 5 mins
- Technical: Basic Python Programming
- Software: Anaconda Suite for Python 3 with pre-installed python packages: numba, numpy
|Building a open source voice assistant with Python deepspeech library by Mozilla||Platforms like Amazon Alexa, Google Assistant and voice services by other tech giants are in buzz now and they are offering developers to build custom applications over it. Unfortunately, these are also a serious violators of privacy and all the voice data collected(sometimes without our consent) are used to feed their machine learning algorithms.|
But what if you wanted to build and assistant that runs locally with a decentralized edge server and ensures the privacy of your data? You can do it using open source Rasa, Mozilla DeepSpeech and Mozilla TTS tools. This workshop will teach you build a open source voice assistant that respects your privacy and keeps your data in your control. The best part of this workshop is it would be very beginner-friendly and in easy-to-build methodology.
Prerequisite software to be installed:
Python3, pip3, npm and GNU/Linux system(preferable, but not mandatory)
|Digital Signal Processing using Python||The Digital Signal Processing algorithms such as linear convolution, correlation, Discrete Fourier Transform, FIR filtering techniques, IIR filter techniques, Power Spectrum Estimation etc, are implemented using Python software along with Python libraries such NumPy, SciPy, Matplotlib and Pylab . A hands-on training will be given for the implementation of above mentioned signal processing algorithms. Further this workshop is also focused on Digital Image Processing using Python.|
|Simulation of wireless communication system using Python||Simulation of wireless communication systems|
Duration: 3 hours
Track: beginner or advanced : Beginner with an interest in understanding how mobile/wifi wireless communication works at a bit level
Intended audience : ECE /CS graduates
Why should someone attend your workshop?
-- To get to understand how the physical layer works, the last layer in the OSI framework
-- To get to understand the need for simulation and the usefulness of the same
What will they get at the end of it?
-- Usage of numpy,scipy and matplotlib and how to build a wireless communication system using the same
-- Working/Design principles of ADC, Filters, Transforms (IFFT, FFT), Modulation schemes like BPSK
Outline of workshop with a reasonable breakup in terms of time.
-- Intro to simulation using Python - 20 minutes
-- Intro to wireless receivers and and the blocks present in the same - 20 minutes
-- Simulation of ADC, Filters, FFT, BPSK in Python - 100 minutes
-- Simulation of an OFDM wireless communication system in Python - 30 minutes
-- Q & A and buffer time - 10 minutes
Possibly provide any material or similar material you plan to use.
-- I am attaching the same. This is a rough draft. The one I will use for the workshop will be built on this.
A small paragraph about you with an emphasis on your experience in the area and teaching this material. Essentially, why are you well suited to teach this workshop?
-- I have conducted this workshop for ECE engineers in a couple of engineering colleges in Chennai and have received fairly good response. This is an area in which I have been working for the last 15+ years and passionate about it.
-- Regarding my presentation skills, I gave a talk on applications on Baye's theorem recently. It is at https://www.youtube.com/watch?v=xL3c3xoIXA4
|Robotics programming with rospy||Workshop on robotics programming with rospy|
Rospy is a pure Python client library for Robot Operating System (ROS) which itself is a open source framework with collection of tools, libraries, and conventions for writing robot software. rospy helps to quickly prototype and understand robotics fundamentals, concepts and algorithms better.
Intended Audience => Anyone who has a prior programming experience in Python. Audience must have a system connected to the internet. Its good if they have ROS pre-installed inside ubuntu environment else they can use ROS Development Studio inside web browser.
Workshop Outline =>
- Introduction to robotics programming (5 Min)
* How robots work and how are they programmed ?
* Why there is need of a common platform for robotics programming.
- Introduction to ROS (10 Min)
* Where ROS fits in robotics ecosystem.
* Brief on ROS History and versions.
* Comparison of roscpp & rospy
- Why one shall learn ROS and rospy (5 Min)
* Features of ROS
* Existing industrial and research applications of ROS
* What all you can do with ROS and also what you can't do with it.
- Explaining all the tools to be used (10 Min)
* ROS Development studio (RDS)
- Understanding ROS Architecture and Mechanism (30 Min)
* CLI Commands
* File System
* Messages, DataTypes
* Topic, Service, Action
( All this ROS concepts will be explained with hands on example )
- Merging it all together (40 Min)
Putting all the concepts learned so far to build a Line Follower Robot.
* Acquiring feed from Camera
* Identify line in camera feed
* Steer robot to stay in line
(All functionalities will be programmed in python scripts using rospy)
- Where shall you go next ? (5 Min)
* ROS Resources - Books, Websites, Blogs etc.
* What all you can build with rospy. (project inspiration)
- Kahoot Quiz & Doubt solving(15 Min)
A quick quiz on kahoot platform to revise concepts learned followed by doubt solving session
By the end of workshop audience will have brief understanding of robot programming with rospy. They will have a detailed knowledge of ROS architecture. Using concepts learned in workshop audience will be able to write software to control robots.
|Applied Machine Learning in Python using scikit-learn, mlxtend and pandas||“A baby learns to crawl, walk and then run. We are in the crawling stage when it comes to applying machine learning.”|
With the advent of Deep Learning algorithms a decade back, the field of data science and machine learning has witnessed renewed zeal and enthusiasm. Today, every firm is eager to hire a data scientist who can derive value out of the data, but the key question is - Where should I begin? Various industry leaders are deploying deep learning models, should I do the same? Is traditional machine learning still relevant in this era to solve my business problem?
In this workshop we will address these question and take a deep dive into applying some of the most widely used traditional machine learning algorithms on real life use cases. We will utilize open source libraries - scikit-learn, pandas & mlxtend for this purpose.
The key steps we will employ to tackle each problem are:
1) Understanding the algorithm
2) Importing the data
3) Data wrangling using pandas
4) Machine learning model development using scikit-learn/mlxtend
5) Model performance evaluation
Each exercise will employ a jupyter notebook based learning environment.
The workshop session will be divided as follows:
1) Introduction to Machine Learning - 5 mins
2) Why traditional machine learning is still relevant! - 5 mins
3) Exercise #1: Real Estate Valuation using Regression Algorithm (OLS) - 15 mins
4) Exercise #2: Market Basket Analysis using Association Rule Learning Algorithm (Apriori) - 20 mins
5) Exercise #3: Credit Risk Analysis using Instance-based Algorithm (kNN) - 15 mins
6) Exercise #4: Macroeconomic Analysis of Countries using Clustering Algorithm (k-Means) - 20 mins
7) Exercise #5: Credit Risk Analysis using Bayesian Algorithm (Naive Bayes) - 15 mins
8) Exercise #6: Credit Risk Analysis using Decision Tree Algorithm (CART) - 20 mins
9) Closing Remarks and Q&A - 5 mins
- Technical: Basic Python Programming
- Software: Python 3.6+
Please install the following python packages -
pip install scikit-learn, pandas, mlxtend
|Application of Machine Learning in Medical Image Analysis||Machine Learning algorithms may have applications nearly in all fields of medical science and health care like in new medicine discovery, assessing, diagnosing diseases, radiotherapy etc. This leads to significant positive impact on human lives.|
In recent years medical records are increasingly digitized. Medical Images are a core part of patient’s electronic health record (EHR). Currently mainly manual examination of images is done by radiologist to detect diseases. However, this is having limitations like subjectivity, large variations across radiologists, and huge workload leading to human error and affecting scalability.
Machine Learning algorithms have shown huge success at computer vision tasks. This motivates applying machine learning techniques on medical images to assess and diagnose diseases.
The workshop will focus on analysis of medical images using Python libraries like OpenCv, scikit-learn, keras to detect disease.
Participants attending this workshop will get exposure to application of machine learning for medical image analysis. They might be motivated to carry on further research leading to design of automated low cost, scalable and accurate medical image analysis systems.
After completing this workshop participants should be able to
a) enhance their understanding of Machine Learning in general.
b) appreciate the huge potential of applying machine learning in healthcare
c) get an overview of various types of medical images used to assess and diagnose diseases
d) understand typical Computer Vision Tasks (like Image detection, recognition, segmentation, registration)
e) get an overview of python libraries like “numpy”, “scikit-learn”, “keras”, “tensorflow”, “OpenCV”
f) understand moderately complex python code written using the above libraries
Outline of workshop
Brief Overview of Machine Learning (Supervised, Unsupervised, Deep Learning) (10 mins.)
Application of Machine Learning in Healthcare (10 mins)
• Disease Diagnosis like Non-invasive diagnosis of cardiovascular disease,
• Analysis of Medical Images,
• Personalized Medicine,
• Improving radiation therapy in cancer treatment
Computer Vision Tasks (10 mins)
• Image detection,
Types of Medical Images (15 mins)
• CT Scans
• MRI Scans
• Retinal Images
Case Study: Machine Learning Based Detection of Diabetic Retinopathy (60 mins.)
• Fundus (Retinal) Image Representation using "Bag-of-visual words model" (OpenCV API’s to be used)
• Classification of Retinopathy using SVM or Logistic Regression (scikit-learn API’s to be used)
• Alternative approach using Neural Network (Keras API’s to be used)
Assignment/Code Walk through (15 mins)
|KuttyPy: Learn Microcontrollers with Python||Extending programming skills from the confines of a computer, to engaging in real world events such as |
controlling the lights and fans in a room, or measuring environmental conditions requires the use of
microcontrollers for which programs must compiled and uploaded.
KuttyPy is a USB interfaced board that gives you an option to study microcontroller register internals in pure Python.
It skips the compile+upload requirement, and offers a 'live' link for reading and writing the special function
registers(SFRs) of the ATMEGA32 microcontroller used in the hardware via either the PyQt based intuitive
graphical interface, or the Python library, without any need for compiling or uploading code to the device.
The graphical utility is designed to resemble the kuttyPy development board in appearance,
and the REGISTER manipulations carried out are explicitly shown to the user, thereby enabling an easy migration
to writing C code for embedded applications.
Please Refer to READ THE DOCS for screencasts, code examples, and API ref: https://kuttypy.readthedocs.io/en/latest/
# Use case as a monitoring tool
The Hardware has 32 I/O pins, all of which can be controlled directly from Python.
Some of these pins have 10-bit ADC functionality, and the Python software displays their
values using analog gauge widgets, and also has the option to plot using matplotlib/pyqtgraph .
There are also built-in analytical tools for fitting signals using scipy's leastsq method.
# Using Plug and Play I2C sensors
Communication protocols for around 5-6 commonly available sensors for pressure, temperature,
acceleration, luminosity etc have been coded into the kuttyPy library, and users can use it as a
bridge for reading values from these with a single function call.
The software is capable of detecting plugged in sensors, and showing their values with a fast refresh
rate. List of supported sensors, and animations are in Read The Docs: https://kuttypy.readthedocs.io/en/latest/sensors/
The programming API has been carefully documented with examples: https://kuttypy.readthedocs.io/en/latest/programming/i2c/
# Using as a regular ATMEGA32 development board
The software is also capable of invoking avr-gcc to compile C code, and upload to
spare flash memory in the KuttyPy hardware in order to make standalone applications.
Doing so will not overwrite the KuttyPy firmware because it is a part of the bootloader.
It will cater to a beginner audience interested in microcontrollers. Should be familiar with the use of Python functions.
Will bring the development boards (Please inform expected audience number), and a set of accessories for participants to use.
+ Introduction to Binary, and relevance in microctrollers
+ using the GUI, and understanding REGISTER manipulations
+ Writing python code to blink LEDs and read ADC values
+ Using Matplotlib to plot ADC values in 'real-time'
+ Using I2C sensor MPU6050 to study the motion of a pendulum with Python code.
+ Writing a simple C program, compiling and uploading it
Software is available for Ubuntu 18.04 + (deb file / source code) . It is also on Debian, thanks to maintainer Georges Khaznadar, but
most recent changes may not be present, so I will prefer to supply an unofficial deb with the latest code.
Software is also available for Windows (Pyinstaller+Innosetup) .
Dependencies: python3-serial, python3-pyqt5, python3-pyqt5.qtsvg, gcc-avr, avr-libc, python3-qtconsole, python3-scipy, python3-pyqtgraph, python3-matplotlib
Source Code Link: github.com/csparkresearch/kuttypy-gui
Hardware resources: http://expeyes.in/kuttypy/index.html
Documentation resources will primarily be from READTHEDOCS: https://kuttypy.readthedocs.io/en/latest/
pdf manual can be exported from this as well.
Acknowledgement: This work was built upon KuttyPy from the ExpEYES project (http://expeyes.in/kuttypy/index.html) which
had the basic idea of reading and writing registers using Python. We thank the original authors for open sourcing the hardware designs.
We also thank the authors of Optiboot for making the bootloader protocol freely available
|Building ML Applications with Gramex||Data application building is a hard and time-consuming process. The inability to build data-driven applications causes most data science projects to fail the production deployment test. |
Gramex is an open-source data science framework that enables both developers and data scientists to convert their Python projects into web applications. Gramex simplifies the application building process via low code, configuration-based approach.
In this workshop, participants will learn how to use various Gamex components to:
Connect apps to filesystem and DB sources, and manipulate the flow of data through REST APIs,
Train sklearn models on the data and expose the trained models on the frontend,
Use the data and the trained models to create interactive dashboards via configuration
Gramex abstracts much of the data interfacing and ETL processes that are typically required by dashboards, converting these processes from Python code into YAML configuration. Further, gramex components allow arbitrary Python code to run as REST APIs. This makes backend engineering easy.
Users have reported from 60 to 80% reduction in coding effort with Gramex. It also bundles a JS library which orchestrates frontend interactivity between multiple components.
The goal of the tutorial is to enable Python developers to build web applications without having to worry too much about implementation details, MVC frameworks, databases, and filesystems. As a humble, low entry barrier alternative to Django or Flask, Gramex enables users to build a web-based portfolio for their data-driven applications.
Gramex - https://learn.gramener.com/guide/
Pypi - https://pypi.org/project/gramex/
|Calculus with Python||Python with easy to read and learn features is an wonderful learning aid. It sets up perfect tool to know about calculus and its real-world applications. This workshop goes through the use of calculus in three simulation examples and aims to give a basic introduction and learning pathways for numerical modeling. The first example is about taxi trip simulation, second on trajectory models and third on simulation using Pyclaw library. We will have introductory exercise on basic of calculus, different concepts involved in the cases.|
Calculus basics with Python (50 minutes), [Notebook](https://github.com/nishadhka/Learning-Calculus-with-Python/blob/master/introduction/Calculus-Differentiation-Integration.ipynb)
Solving deferential equations
Integration of simple function
Some interesting Numerical simulations
Extras: 1D-Wave equation simulation
Extras: Large Eddy Simulation of Ocean (in Julia)
Taxi trip simulation (15 minutes): Taxi trip data gives orgin, destination and duration details of a trip. The real world movement of the taxi will be simulated. Uber taxi trip and Open Street Map data will be used, Bangalore city is simulation domain.
Trajectory models (15 minutes): PUFF model is a Lagrangian particle model which simulate trajectory of particles emitted from volcanic eruptions. Using a case study data on volcanic eruption, the model will be simulated.
Pyclaw library (30 minutes): Pyclaw is a python based solver for partial differential equations. This library is being used for various geophysical simulations, a demonstration on its example.
|HDR imaging using Deep Learning||Common digital cameras can not capture the wide range of light intensity levels in a natural scene. This can lead to a loss of pixel information in under-exposed and over- exposed regions of an image, resulting in a low dynamic range (LDR) image. To recover the lost information and represent the wide range of illuminance in an image, high dynamic range (HDR) images need to be generated. There has been active research going on in the area of deep learning for HDR imaging. The advances in deep learning for image processing tasks have paved way for various approaches for HDR image reconstruction using feed-forward convolutional neural network (CNN) architectures|
To better utilize the power of CNNs, we exploit the idea of feedback, where the initial low level features are guided by the high level features using a hidden state of a Recurrent Neural Network. Unlike a single forward pass in a conventional feed-forward network, the reconstruction from LDR to HDR in a feedback network is learned over multiple iterations. This enables us to create a coarse-to-fine representation, leading to an improved reconstruction at every iteration. Various advantages over standard feed-forward networks include early reconstruction ability and better reconstruction quality with fewer network parameters. We design a dense feedback block and propose an end-to-end feedback network- FHDR for HDR image generation from a single exposure LDR image.
Qualitative and quantitative evaluations show the superiority of our approach over the state-of-the-art methods. This approach can also be extended to HDR video generation by propagating temporal coherency throughout the network using recursive, attention-based mechanisms and exploiting regularisation terms that can infuse such temporal consistency in the way the network is trained.
Some more information about this can be found here - https://mukulkhanna.github.io/projects/FHDR
|Analysis of Electronic circuits using PySpice and Scipy||A Most important step in fabricating any electronic subsystem is to simulate the schematic model of the circuit and analyze its behavior in the real world. There are many electronic circuit simulators available on the internet but none of them gives flexibility and scope to analyze the data acquired from the simulator. PySpice is a free and open-source Python module which interface Python to the Ngspice and Xyce circuit simulators.PySpice implements a Ngspice binding and provides an oriented object API on top of SPICE, the simulation output is converted to Numpy arrays for convenience.|
Any netlist custom created by the user can be called via Source code and simulated to get the required data for analysis.
Frequency response graph is the graph plotted with Frequency in X-axis and response (in decibels) in Y-axis. It indicates about the exaggeration and attenuation of the signal of that frequency. In many RF applications, 3 dB drop in signal is tolerated (as it represents that output voltage is half of the input voltage). Hence, finding the frequency at which this happens plays a vital role in circuit analysis. This is carried out by Scipy libraries and other simple mathematical methods of approximation.
A function is defined and modeled using ‘scipy.optimize’, where the curve is fitted to the existing data points and the frequency at which 3 dB drop occurs can be estimated using an algorithm which uses Binary Search Algorithms and Approximation methods. This can become handy when complex circuits are designed and need to be studied before built or produced in large-scale.
Some of the Libraries of Python used are Numpy, Scipy, PySpice, Matplotlib, and others. The same can be implemented with existing data points from the external simulator, using suitable data frame handling libraries such as Pandas or JSON. In this way, Frequency Response of any R-C and Opamp based filters or modulator circuits can be studied and response can be analyzed, even under different external factors such as temperature affecting the operating conditions
|Python application for road safety: Pothole Detection, Visualization and Updating the Android Navigation Map||The presence of potholes on the roads is one of the major causes of road accidents, deaths as well as wear and tear of vehicles. In order to solve this problem, various techniques have been implemented, ranging from the use of vibration-based sensors to 3D reconstruction using laser imaging to thermal imaging techniques. However, all these techniques have some drawbacks such as the high setup cost and high risks while detection as well as they require high computational power for visualizing it. The objective of this work is to analyze the accuracy and feasibility of pothole detection in the field using the self-built convolutional neural network (CNN) implemented using TensorFlow python library and its representation on the custom-built navigation map using Android provided framework. Images of potholes under various conditions and weather were collected from Kaggle website. Thereafter, augmentation technique like Image zooming, image skewing and other methods are used and CNN approach was applied using TensorFlow library for pothole image recognition. This trained CNN model had been added to the map’s camera for full functioning.|
Additionally, a car detection and lane detection CNN was built using TensorFlow and their outcome was visualized using OpenCV on the camera screen. To show the structure of the potholes, present on the roads which have been detected with the help of the CNN, we use OpenCV and common Image Processing techniques. The custom- built map in android is enriched with the navigation feature and a self- built anomaly detector and loader was customized to include the location of potholes which were detected by the CNN and to represent them with the band of colors ranging from no pothole road to abysmal road on the map for future convenience and reference. The results of this work will help guide the developers of the navigation systems to incorporate the feature of pothole/road defect detection and visualization on the navigation maps.
|Python in High Energy Physics||High Energy Physics is the study of the most fundamental constituents of matter and how these elementary particles interact. Often synonymous to Particle Physics, High Energy Physics seeks to find the secrets of the Universe, one of the recent major discoveries being that of the Higgs Boson that confirmed the Standard Model that dictates how all the forces in the Universe interact with each other. High Energy Physics is probably the physics sub-field that has adopted Python most rapidly, only second to Astrophysics. The talk starts with a look at how computing has looked like in the field of High Energy Physics in the past and how a lot of physicists played major roles in the development of Computer Science. It then explores the emergence of Python as the language of choice for several physicists and two of the major libraries that have been vital to the adoption of Python in the High Energy Physics community - cppyy and uproot. These are especially important since they demonstrate the different ways one could approach shifting the High Energy Physics community from C++ to Python successfully. The talk will focus on a review of where and how Python is used in the High Energy Physics community and how it is slated to look like in the future. High Energy Physics has its own python toolkit, scikit-hep which comes with a set of python libraries for use by physicists. The Scikit-HEP project is a community-driven and community-oriented project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python. It is also about improving the interoperability between High Energy Physics tools and the scientific ecosystem in Python. This year is ideal for this particular talk, being the year when according to some available data, Python usage trumps C++ usage in several High Energy Physics experiments at CERN - as some physicists have dubbed it, this is the year of Python in High Energy Physics.|
|Pyspark based Scalable Machine Learning Algorithm for Handling Soybean Genome Sequences using Biopython||A major challenge in analyzing the data from high throughput genomics is the way to deal with the huge amount of data using a variety of traditional tools. The role of biopython in genomics focus on distributed cooperative effort to use Python libraries and applications which address the necessities of present and future work in bioinformatics. Biopython contains a number of different sub-modules for common bioinformatics tasks such as dealing with DNA, RNA & protein sequence, reverse complementing a DNA string, generating phylogenetic trees,etc. It is developed by Chapman and Chang, mainly written in Python. It provides a lot of parsers to read all major genetic databases like GenBank, SwissPort, FASTA, etc.inside the python environment. We use bio python library to generate phylogenetic trees of DNA sequences. The massive growth in genome sequencing has a development rate quicker than expected by Moore's law. The size of the data set is getting so much larger that it requires huge data processing technology. Genomic variations and population genetics are one of the areas, which are identified as a clustering issue. There are many challenges handled in this area, such as cluster analysis, feature extraction, and genome clustering. One more challenge for genome data is its volume, sequences being voluminous cannot be processed by a single system. In order to support the investigations in bioinformatics, explicitly on genomic variations and population genetics, we have implemented a feature extraction approach for converting DNA sequences into the feature vectors. The main aim is to process such data so that it can be made usable to be provided as input to the machine-learning algorithm. The dataset which comprises of DNA sequences of Soybean plants is preprocessed and minified into a twelve feature vectors. Soybean diseases are one of the important factors which have direct impact on global soybean productivity and climate change will further aggravate the situation. Solution for this can be Clustering of genome sequences, which will help to form logical groups of plants having similar issues in their DNA. We implement the scalable kernel fuzzy clustering algorithms on Apache Spark, which perform the efficient clustering of Big Data. Our scalable kernel fuzzy clustering algorithm provides the better potential for the genomics data handling. |
Apache Spark is a fast in-memory, cluster computing framework. Though it was originally written in Scala, its open source community has developed PySpark to support Python for Spark. PySpark provides data scientists a link between Apache Spark and Python through its library Py4j. The widespread popularity of PySpark is only a proof of how efficient and powerful it is. PySpark is proved to be 100x faster than traditional large-scale data processing frameworks. PySpark gives it users benefits like in-memory processing , low latency compatibility with various languages, preferable frameworks for huge datasets, powerful caching, very good disk persistence, supports RDD etc. In short PySpark is the Python API for Spark. Python is not the only programming language that can be used with Apache Spark. The data science beginners would prefer Spark, but choosing which language to use with Spark would be their next major decision. There are many reasons why PySpark is preferred over Scala with Spark combination. Python is known for its simple syntax, easy interface, huge abundance of ML, DS and CV libraries, Readability and maintenance of code, and the list would go on.
Unsupervised learning on genome data requires huge computational power and memory. Apache Spark with python serves the needs exactly.
|Rospy - Extending robotics education and research||Robotics has been seen as an emerging field as well a common integration in many technologies and processes. Several industries have included mechanical entities in order to make their tasks easier as well as to cover up the work that was only a limitation to humans earlier. Apart from commercialization, robots have also been used for several educational and research purposes and the language used extensively with that is Python.|
Robotics and python have gone a long way together. Even today, most of the scripting and code building for any robotic application prefers Python as the language. It has several methods and frameworks, libraries, that provides an environment for an easy integration of software with the hardware. However, Robot Operating System or better known as ROS has been found as an efficient way to do so.
Robot operating system(ROS) is an open source framework for writing the robot software. It makes various modules and synchronizes them all to perform a particular robotic application. It can be coded into multiple programming languages and the three core client libraries available with ROS are roscpp, rospy, roslisp. We chose rospy to work with as it let us quickly prototype various algorithms and it is a well maintained, documented, stable and open source.
Rospy is a pure Python client library for Robot Operating System (ROS) which itself is a collection of tools, libraries, and conventions for writing robot software. Additionally, it can also be used with various other tools such as,
- OpenCV(computer vision library),
- Gazebo(simulator with dynamic and kinematic physics),
- Fetch (ROS Compatible robot),
- rviz(sensor data visualisation tool),
- roslink (protocol to integrate robot with IoT)
- MoveIt(motion planning library)
All these tools integrates well with rospy and extends its capabilities.
We used rospy for simulating fetch robot inside a stockroom environment and have observed the following advantages:
- It can make robotics education and experimentation easy and quick.
- Using this, educators can teach starting from easy to complex robotics concepts and algorithms using only software, making it much more economical.
- In rospy, using 2D and 3D visualization we can also receive an almost real world response without using an actual robot.
- Lots of pre-built ROS compatible robots are already available which we can work on using rospy.
- It can be used for industrial and scientific research as well. As even a simple robot can have lots of complex component embedded within it, rospy makes it possible to work just on a specific robot component.
- It accelerates research as libraries for common robotic tasks are pre-built in it enabling industries and scientist to do multiple tests even prior to hardware implementation.
What inspires us to present this particular framework to a larger audience is its capability to take robotics to the next level. When earlier it was just setting up the electronic components and fixing the circuit with complex coding, with the help of ROS can be received as a pre-simulated, time saving project. Realizing the industrial use, the framework itself has a module for that explicit purpose (https://rosindustrial.org/about/description), enabling enthusiasts to get even a wider view.
Some other real world examples that uses ROS as a framework are,
- Webots , a free and open-source 3D robot simulator used in industry, education and research. (https://en.wikipedia.org/wiki/Webots)
- Husky, a medium sized robotic development platform (https://clearpathrobotics.com/husky-unmanned-ground-vehicle-robot)
- HERB, developed at Carnegie Mellon University in Intel's personal robotics program
- Raven II Surgical Robotic Research Platform
Although rospy favours implementation speed at the cost of runtime performance, we find it worth learning as people who loves to explore Robotics with Python and would love to share the knowledge gathered in order to build a bigger community.
|Storing a few versions of a 5GB file in a data science project||Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models. Duplicating the files and renaming them because of new changes isn’t really an efficient practice. This lack is particularly acute for engineers who just moved to ML space.|
Versioning small files has become quite feasible since the existence of Git. Operations like staging, pushing and checking out files can be done in milliseconds. But it becomes a hectic task when comes to handling large data files and models which is really common in data science projects. Even designers, sometimes, need to share large illustrations (photoshop or illustrator files) and they might collaborate over the different aspects of it. Git isn’t sufficient here because of its principles of versioning. It uses a directed acyclic graph (DAG) which stores references to blobs of tracked file. So when a large file is modified frequently, it keeps duplicating. This takes huge disk space and is a big problem and hence, a very inefficient to version a data science project.
Although a solution for versioning large file does exist, and it is called Git LFS, it has a few limitations such as its support is only on a few cloud storages (like BitBucket cloud). The main purpose of Git LFS was to track assets of a project which are a supplement to it. In a data science project, the large files which are datasets are an inclusive part of the project and are wired up with the source code so as to produce models. For example, a dataset is divided into two more files for training and testing the model.
With Data Version Control (DVC), we aim to solve the problem of versioning large files effectively in a data science project or even a project with huge assets. The files tracked by DVC are checked out within a very short period of time (unlike in Git which may take hours). Since DVC easily integrates with Git and has Git-like commands, a software engineer wouldn’t have to learn about a full new ecosystem too. In fact, all the source code (small files) can be pushed to a service like GitHub and all data files can be stored on a separate cloud storage (S3, GCS, or bare metal SSH server) and DVC is enough to manage them both. In this way, DVC completely segregates data and source code versioning.
Data versioning isn't the same as source code versioning since data files are huge in size. DVC will help a software engineer who got used to Git and versioning survive in ML projects. It is a simple command-line tool with no infrastructure so the only pre-requisite is to know how to version source code.
An outline of the talk:-
2. Problems in versioning large files
3. Drawbacks of Git when used to version big datasets
4. Why Git LFS may not be enough to manage big files specifically for data science
6. A simple demo to showcase DVC’s efficiency in versioning large files
7. Explain the workflow of DVC and how it easily integrates with Git for version control
8. Encourage to contribute
Work-in-progress slides of the talk: https://docs.google.com/presentation/d/1tSpvslUS3RCfA66u-rXhHbnO5uQpcs_pnDfSAa5W-fI/edit?usp=sharing
An introductory video for the talk: https://drive.google.com/open?id=1-CGxSJ3Qw4b3AS4vkAe3iWTawP4x3IvJ
|KDAP: An Open SourceToolkit to Accelerate Knowledge Building Research||With the success of crowdsourced portals like Wikipedia, Stack Overflow, Quora, GitHub, etc., a class of researchers is driven towards understanding the dynamics of knowledge building on these portals. Besides providing information using the collective intelligence of the crowd, these portals are unique in a way that users exercise diverse roles (like editing, voting, commenting, etc.) leading to knowledge maturity. Despite the fact that collaborative knowledge building portals are known to be better than expert-driven knowledge repositories, limited research has been performed to understand the knowledge-building dynamics in the former. This is mainly due to two reasons; first, unavailability of the standard data representation format, second, lack of proper tools and libraries to analyze the knowledge-building dynamics.|
We describe Knowledge Data Analysis and Processing Platform (KDAP), a programming toolkit that is easy to use and provides high-level operations for analysis of knowledge data. We propose Knowledge Markup Language (Knol-ML), a standard representation format for the data of collaborative knowledge building portals. KDAP can process the massive data of crowdsourced portals like Wikipedia and Stack Overflow efficiently. As a part of this toolkit, a data-dump of various collaborative knowledge building portals is published in Knol-ML format. The combination of Knol-ML and the proposed open-source library will help the knowledge building community to perform benchmark analysis.
KDAP is developed for single big-memory multiple-core machines and stabilizes the disparity between maximum performance and compact in-memory data representation. The toolkit developed aims to facilitate the knowledge-building community with open access to standard datasets and efficient data analysis methods. This will increase the ease of inter-portal analysis and also reduce the overhead for researchers to learn different data formats and corresponding APIs/libraries for analysis of various knowledge-building portals.
|Daskify an MPI application for distribution using Dask : Learnings during Implementation||Message Passing Interface(MPI) is a very popular distributed communication framework for scalable large parallel applications. It has been around for a few decades now and has a proven track record for providing the best performance for HPC applications. There are also multiple implementations for distributed machine learning applications using MPI for distributed training. One such implementation is IBM’s Snap machine learning(SnapML) library (https://www.zurich.ibm.com/snapml/) which has an MPI based multi-node and multi-GPU distribution for the training of classical machine learning algorithms.|
On the other hand, it is well understood that a big chunk of data scientists time is spent on data preparation (cleansing, pre-processing and normalization). Pandas and NumPy are the optimized python packages that are popular for this work. There was a need to scale out these libraries beyond a single machine. Implementing that using MPI for multi-node use would have required the data scientists to venture out of the comfort zone of Jupyter notebooks, and Dask(https://dask.org) was born! Dask provides native scaling of Pandas and NumPy with a similar API interface. It continues to be co-developed and enhanced in the Python ecosystem.
With what existed, we could make the best use of both worlds (SnapML MPI and Dask) by:
1. Performing all the data preparation in a multi-node distributed fashion using Dask Dataframe and Dask Array.
2. Convert it into NumPy for dense data or CSR/CSC for sparse data, and dump it to a file.
3. Use SnapML for loading the data file and perform training in a distributed manner using MPI.
Even though, the above steps worked as expected, we had to compromise the flow by having an intermediate step of dumping data back to disk. Then it had to be re-loaded back into multi-node memory by the SnapML MPI library and trained. To overcome this limitation and enable a direct consumption of Dask Array, we explored what all was required to Daskify SnapML distributed implementation.
With the below-listed items and some more we were able to create a Dask version of Distributed snapML and made it compatible with Dask Array as input -
1. Modify the SnapML python code to work with dask.distributed client object and make good use of `client.submit()` and `client.map()` for launching parallel tasks to dask workers.
2. Replace all the MPI_Allreduce() calls in C++ with NumPy calls like max(), sum() etc. in Python.
3. Modify the data loader code to understand dask workers and rechunk the input Dask array to one partition per worker/node.
4. Use `client.has_what()` to identify the portion of data in each worker and use that information to ensure correct logical continuity of jobs on each worker.
With this work, we were able to implement a Dask distributed version of SnapML’s LogisiticRegression to work directly with Dask Arrays. This ensured that all the pre-processing, training, test and evaluation steps using Dask and SnapML, could be performed in a single Jupyter notebook. Hence, all the execution steps could now be performed in a distributed manner on multiple nodes without any loss of end-user experience. We observed a slight degradation in performance when compared with the existing MPI implementation, but the gain of functionality easily made up for this.
Our proposed presentation at Scipy is going to delve into the technical details of what modifications were required in the library to achieve this. We will also talk about the challenges and share the learnings with the Scipy community.
|Clinical decision support system for heart disease prediction and multi object detection||Prediction of heart disease arrhythmia's in ECG signals using python tools. ECG is one the best diagnostic tool for analysing the heart rate. It is about segmenting the ECG records into single ECG beat and do classification of the patterns into heart diseases. These segmented ECG beats are classified into various classes based on the labeled image present in the database repositories. BioSPPy (Biosignal Processing in Python) tool supports for pattern recognition of an ECG signal. This tool helps to filter the ECG waveforms. It also performs the R-peak detection and computes the sampling rate of the heart immediately. It assists for the better frequency analysis. Installation is easily done with pip package installer($ pip install biosppy). Another python tool HeartPy which aids in handling various analysis tasks with heart disease dataset. This toolkit allows to handle very noisy ECG data and used for pre-processing the signals as well. HeartPy provides the functionality to get and parse data from the .csv and .txt files, as well as Matlab .mat files. This particular tools Estimates Sample Rate handling with ms-based timers and DateTime-based timers. In the research, the MIT-BIH datasets are considered for heart disease classification. Different types of classifiers have been used for heart disease prediction and visualized its performances by plotting objects using matplotlib libraries. Heartpy warns and notifies with an exceptions whenever that the inputting signal is of insufficient with its quality. This exception reports with an alert, when input signal does not contain enough information to process. Another research work is moving on with the prediction of heart disease using CT image. This is all about predicting inflammation of FAT plaques present in Coronary arteries. There is a biomarker called FAI(Fat Attenuation Index) introduced to find the risk factor of the disease. The prevalence rate of the disease is predicted based on the hazard ratio of FAI indexing which varies from lower to higher cut-off. Another research work is related to object detection using darkflow. This work is for detecting number plates, helmets, persons and other objects related to traffic hazards. OpenCV toolkit enables visual predictions of the detected vehicles. The following are dependencies used in the research works. |
|A Novel Convolutional Neural Network Architecture for Audio Emotions Classification||Emotions are essential to communication. During the past few years, we have intended to create technology to convey emotions through emojis and speech generation. Our research describes an experimental study on the detection of emotion from audio speech. This study utilizes a corpus containing continuous emotional speech comprised of 8 different emotions: Angry, Disgust, Fear, Neutral, Happy, Sad, Calm and Surprise. We created a 1D CNN-based model that can detect emotions in speech with reasonable performance against a random baseline and a simple neural network.|
|Powering the Python Programming Laboratory||Computer Science students from Engineering, Arts and Science students study programming languages as their core courses. Programming assignment through machines is an important education tool. Most of the students found difficult during their first programming course and they need lot of guidance and support from the instructors. The availability of a teaching tool in a continuous learning process to facilitate monitoring and personalized guidance enables the student to reduce the initial difficulties. If the work is not appropriately assessed by the instructor then and there, students may not be known that their approach towards programming is correct or not. |
The traditional way of assessment meets the evaluative aspect, but does not provide the students to learn from their mistakes. Automating the Programming Labs (APL) is a multi-disciplinary approach to develop, optimize, research and capitalize on technologies in the laboratory that allow improved and new processes to retain and improve the learning potential of the students. Besides, the traditional evaluation method requires considerable time and additional efforts by the teacher. For teachers, APL is a boon and a ready reckoner which allows them to perform other productive tasks.
APL leads to reduce the lab processes cycle times in huge manner and accurate evaluation is carried out on all students. APL neither requires compilers nor IDE installation and maintenance. It needs an internet connected computer with a popular web browser. APL is comprises of two components namely Moodle and jail system. Both are released under the GNU/GPL License.
USE CASE: Using APL, Python Programming Laboratory (we have around 300 programs) is automated in our institution since 2017. The automation is carried out by writing evaluation scripts using Python for all Python experiments. The Python evaluation script evaluates the user written programs against several test cases. In addition, APL checks for validity of function definition, whether appropriate logic is in place using benchmark competitive programming tools.
All programming and other practical courses are evaluated using APL in our (CSE) department. This has significant influence in students learning style, logical thinking and majorly impacts in their placement. This has created an opportunity for our students to fetch big brand core companies with attractive emoluments.
|Fun experiments on automating office workflows in windows with Python||A lot of the industry and research professionals around the world use Microsoft Windows as one of their operating systems, along with its office suite. In my previous job, I played around with the win32* packages and developed a couple of fun python scripts to automate some day to day jobs. I did this primarily as it was a lot of fun, but on the side note, it saved me time every day. Automatic parsing of the emails, alerts, and logs in outlook. Automatic replies to chat messages received over office communicator. Sending a meeting request over the calendar without touching outlook. Multiple application logins with a single click and automating and making decisions based on things going on the computer screen.|
In this talk, I shall present four such workflows in detail, how they work, and what they can be used for. I shall talk about parsing, navigating, and sending emails and meeting requests without touching outlook along with automatic replies to chat messages on the communicator and making decisions based on visual events on screen. Combined with the power of easy to use machine learning frameworks in Python, we can build some fantastic pipelines for routine office tasks. This talk shall be suitable for attendees from the industry and academia and can help them save time and build efficient day to day workflows for repetitive tasks. Although the background applications here are proprietary, all the tools we shall use are open source, and so shall be the spirit of the talk. The broad ideas and tools presented in this talk can, in general, be extended to automation with any application running on Microsoft windows.