Video hardware and software for the International Brain Laboratory
Guido Meijer1, Michael Schartner2, Gaëlle Chapuis3, Karel Svoboda4 & IBL Collaboration
1Champalimaud Centre for the Unknown, Lisbon, Portugal; 2University of Geneva, Switzerland; 3University College London, United Kingdom; 4HHMI / Janelia Farm Research Campus, Ashburn, VA, USA
Movement is a critical aspect of behavior and correlates with neural dynamics. Neural dynamics control movements, and movements in turn produce sensory input that modulate neural dynamics. It is therefore important to monitor movements in great detail. Using three cameras we tracked two sides of the face and the trunk. We evaluated DeepLabCut, a toolbox for markerless tracking based on deep learning, for tracking of body parts of mice in the IBL task. We found that DeepLabCut can largely replace traditional methods for eye tracking and paw tracking, and improve on tracking of other facial movements including licking, with a single side view video (to be used in IBL behavior rigs). Using the three-camera setup (to be used in IBL ephys rigs), we routinely tracked pupil, paws, tongue and nose bilaterally with a test error of fewer than 5 pixels (1 px corresponding to approximately 0.05 mm; image dimensions 1280 x 1024 px); the training set consisted of 400 images sampled uniformly from two side videos. For future video tracking, an extended training set will be created including images from different mice and rare outlier behavioral events. Furthermore, tracking was highly robust to mp4 compression (h.264). Combining detailed tracking of body movements with physiological recordings will be a powerful tool for investigating the neural underpinnings of behavior. In the following, we describe the hardware and software for video tracking.
Every training rig will be equipped with one Chameleon3 camera with a 16 mm fixed focal length lens (Figure 1a). The camera will capture a single side view at 30 Hz (Figure 1b) for tracking the pupil, at least one paw and the tongue (here the water tube and the mouse’ mouth should be not too close such that the tongue is clearly visible). Imaging will be through an IR longpass filter (730 nm). Lighting is provided by an array of LEDs (850 nm), diffused by an opaque plastic sheet.
Figure 1. a) Positioning of the camera in the training rig. b) Example video of the camera.
Integration with the Bpod will be achieved by connecting the pins of the camera to an ethernet port of the Bpod. The camera will output a 3 V TTL pulse during every frame capture which will be logged and timestamped by the Bpod. The jitter in frame acquisition is smaller than 1 ms (Figure 2).
Figure 2. a) Histograms of the inter-frame interval of a three-minute video from two cameras, one running at 60Hz (blue) and the other at 150Hz (orange). There is a certain variation in the latency between successive video frames. b) a histogram of frame-to-frame latency of the 60 Hz (blue) and 150 Hz (orange) cameras. Jitter is defined as the standard deviation of the frame-to-frame latency variation. 60 Hz camera jitter: 195.6 μs; 150 Hz camera jitter: 143.2 μs.
Standardizing the image of cameras from different rigs is an extremely important aspect of video capture within the IBL. Therefore a program was made which allows users to setup and position each camera in a standardized way. In the program the user is presented with an outline of an empty training rig, overlayed with the live video feed of the camera. The user should then change the camera angle by hand until the outline and the video image show a perfect match (Figure 3a,b). To ensure the video image is as similar as possible between camera, all software camera settings are set to fixed values (frame rate, gain, shutter time, etc.). There are two variables, however, which have to be set physically on the lens of the camera: the focus and the aperture. To do this the user is instructed to place a mouse in the setup and focus the camera until the eye of the mouse is in focus. The aperture is set by looking at a window in the program which shows all saturated pixels in the live video feed (Figure 3c), and a histogram of pixel intensities (Figure 3d). The user is then instructed to open the aperture until slight saturation of parts of the mouse start to appear.
Figure 4. a) The camera image of an empty training rig overlayed with a drawing of another rig. The image and the drawing don’t match. b) After correctly positioning of the camera the line drawing and the video feed match. c) The aperture of the lens is slowly opened until slight saturation (red pixels) appear on the mouse.
The recording rig will be equipped with three cameras: two side cameras and one body camera (Figure 4). One of the side cameras will run at 150 Hz with resolution 640x512 px, the other side camera at 60 Hz (full camera resolution 1280x1024) and the body camera at 30 Hz (1280x1024). The cameras will be recorded by a dedicated video capture computer. The 150 Hz camera will send TTL pulses at every frame capture to both the BPod and the electrophysiology acquisition FPGA. The two side cameras will send a TTL pulse only to the FPGA (Figure 4d). The BPod sends a synchronization pulse to the FPGA at each trial onset. The 150 Hz camera will serve as a secondary synchronization source between the FPGA and the BPod.
Figure 4. a) Camera setup of the recording rig. Note that this is just an approximation of how the recording rig will look like since it is not finalized yet. b) Two cameras track both sides of the mouse (camera 1, 2). One camera records movements of the trunk and tail of the mouse (camera 3). Each camera has one IR-light source. c) Example video recorded with camera 1. d) Example video recorded with camera 2. e) Example video recorded with camera 3. f) Wiring diagram of video with ephys setup.
The training rig camera will be recorded at 30 Hz on the Windows computer that runs the behavioral task. The recording software is Bonsai which allows for a video stream window to be displayed to the user during behavioral training. For the recording rig, Bonsai will also run on Windows, with one side view camera at 60 Hz, the mouse trunk camera at 30 Hz, and the other side camera at 150 Hz (and half the spatial resolution).
We next present tracking body parts using machine learning and how performance depends on video compression. For a performance test, DeepLabCut (DLC, https://github.com/AlexEMG/DeepLabCut) was used to predict the location of 27 features seen from at least one side (note that the points we track routinely for the IBL task differ as we’ll show below). The points include 4 for tongue tracking (2 static points on the lick port, 2 points at the edge of the tongue), 4 for pupil tracking for each pupil and the tips of 4 fingers per paw. In addition, nose, chin and ear tip were tracked (Figure 5). The 27 features are as follows, with “r” or “l” indicating the right or left type of the feature, as seen from the right/left camera:
'pupil_top_r', 'pupil_right_r', 'pupil_bottom_r', 'pupil_left_r', 'whisker_pad_r', 'nose_tip', 'tube_top', 'tongue_end_r', 'tube_bottom', 'tongue_end_l', 'chin_r', 'pinky_r', 'ring_finger_r', 'middle_finger_r', 'pointer_finger_r', 'ear_tip_r', 'pupil_top_l', 'pupil_right_l', 'pupil_bottom_l', 'pupil_left_l', 'whisker_pad_l', 'chin_l', 'pinky_l', 'ring_finger_l', 'middle_finger_l', 'pointer_finger_l', 'ear_tip_l'
In order to create a training set for the DLC network, the positions of these features were manually annotated in 400 images and some of them were further used as “ground truth” for testing of network prediction accuracy. Note that some features are seen from both side views. 300 of the images were uniformly sampled from one of the two 10 min side videos (both 10 min long, at 60 Hz, compressed at 29 crf, 1280 x 1024 px) and 100 were manually selected in order to provide sufficient samples that display the mouse’ tongue. I.e. one DLC network was trained jointly on images from two side views. The labeling was done using the DLC GUI.
Performing cross-validation, the manually-labeled images were split randomly into 280 training and 120 testing images, repeated 4 times for different random splits, in order to evaluate the prediction precision of the DLC network, as suggested by Mathis et al.5. The error was computed as the Euclidean distance in pixels from “ground truth” to the position predicted by DLC. Figure 5 shows the test error across 120 test images for each tracked feature. Note that these results can easily be improved by the inclusion of more hand-selected training images (we plan to generate one from videos of several labs and animals, for generality). The high accuracy of the position prediction for the 4 pupil points (test errors below 2 px = 0.1 mm) allows accurately estimating the pupil diameter and position, thereby replacing dedicated eye-tracking software and obviating the need for an extra eye-camera. Tracking 4 fingers per paw allows the reliable distinction and tracking of each paw.
Accurate lick detection with DLC can be achieved with two points on the tongue tip and two points on the water-port (when the two points of the tube are within a small distance to one of the tongue tip points). The video of the mouse’ trunk can be used for analysis of movement using unsupervised methods2-5.
Figure 5. Performance test of feature tracking with DeepLabCut. a) Example frames from the left side camera with tracked features shown. b) The 27 features allow robust tracking of tongue, pupils and paws. c) For each of the 27 features, the training and test error of DLC is shown (each side video was at 1280 x 1024 px). This already high performance can be further improved with a larger training set.
All points were predicted with less than 5 px error on average across testing images, i.e. comparing manually-annotated feature position with the DLC network prediction (fig 4, bottom). In addition to the automatic evaluation of the test images, the labeled 10 min side view videos were visually sampled, supporting the conclusion that tracking of tongue, paws and pupils is robust. We further found that despite frequent whisker occlusion of the pupil, less than 4 % of the frames had more than 2 pupil points occluded. Two of the four points around the pupil are enough to reliably estimate pupil position and dilation.
Further, we evaluated the tracking performance for a single side view camera only, concluding that pupil, tongue and paws can be tracked reliably, using the network trained on both views. However, one paw is often occluded in one side view and one pupil is never visible, hence the need for two side views for the ephys recording rig.
Once a DLC network is trained, the computation time for the detection of a point for a given image increases non-linearly with the spatial resolution of the image. For this reason and the increased ease of creating a hand-labelled dataset when using cropped videos, we devised a method to automatically crop regions of interest for the side view cameras before applying specialised DLC networks to those cropped videos. (For the recording rigs, the original videos are kept at low compression (crf 17) for other potential analyses.)
First a DLC network is trained on anchor points in order to detect these in the side view videos. These anchor points (top of drinking spout, tip of nose and pupil) are then used to anchor the rectangular regions of interest as shown in fig. 6 (for both side views): 100x100 px region around the eye, 100x100 px region around a nostril, 100x100 px region around the tongue and a 900x750 px region around the paws, which is subsequently spatially downsampled to 450x374 px prior to applying a DLC network.
We automatically track the following points (note that these are slightly different to those used for the accuracy test above):
'pupil_top_r', 'pupil_right_r', 'pupil_bottom_r', 'pupil_left_r', 'tube_top', 'tongue_end_r', 'tube_bottom', 'tongue_end_l', 'pinky_r', 'ring_finger_r', 'middle_finger_r', 'pointer_finger_r', 'pupil_top_l', 'pupil_right_l', 'pupil_bottom_l', 'pupil_left_l', 'chin_l', 'pinky_l', 'ring_finger_l', 'middle_finger_l', 'pointer_finger_l', 'nostril_edge_low', 'nostril_edge_high'
Figure 6. Autocrop procedure as applied to each of the side view videos. A general DLC network is used to find the pupil, nose tip and top of drinking spout in several images in order to find their average position. These locations are then used to crop rectangular regions of interest. Specialized DLC networks are then applied to find points in cropped videos.
Independently and complementary to the recently published mp4 compression test of DLC performance by Mathis et al.6, we trained DLC to track three facial points (pupil center, tongue, and nose) with h.264 compressed videos at different compression ratios, but identical training frames, using ffmpeg for compression. E.g. in a terminal (command line):
ffmpeg -i input.avi -c:v libx264 -crf 29 output.mp4
We found that DLC’s performance is not obviously affected by even highly aggressive compression (mp4, crf=47) (Figure 6). However, crf=47 images are blurry and distorted. We therefore consider crf=29 (smaller crf’s correspond to higher image quality and lower compression ratios) as the maximum for manually annotating these points accurately. It is possible to train DLC on weakly compressed videos and then apply it to strongly compressed video. We propose that all recording rig videos are stored at crf=17 for potential future analyses. In case of server storage limitations, videos should be compressed at crf=35, with a small fraction of trials (e.g. every 25th trial) in addition stored as crf=29 for the creation of training sets. This will lead to an overall compression ratio of 100x, corresponding to approximately 330 MB / hour on the recording rig (Table 1).
Figure 7. DLC prediction accuracy as a function of h.264 compression. For each compression level (x-axis) the corresponding file-size is shown (second x-axis) as well as the test and train error of DLC tracking 3 points. The same 200 hand-annotated images were extracted from the video after each compression, using the same position annotations throughout. Then 100 images were used for training a network freshly, starting with default weights, and tested on the other 100 images. This procedure was repeated 4 times for different random splits of the 200 images into 100 train and 100 test images, each time training a network freshly from default weights.
Table 1. File size overview
1 h at 150 Hz
1 h at 60 Hz
Behavior Rig 1 h (1x30 Hz)
Ephys Rig 1h (2x60 Hz + 1x150 Hz)
Table 1 illustrates the file size for a 1 h side video for different sample rates and mp4 compression levels. Note that to good approximation these file sizes scale linearly with image dimensions and sample rate. Further, increasing crf by 6 reduces the file size roughly by half. Compression via h.264 (mp4) takes changes across frames into account, i.e. areas that do not change much across frames are more strongly averaged over (smeared out) than those where much change happens.
We used the following video acquisition pc and cameras:
High speed camera
IR pass filter
Acrylic sheet for diffusing IR illumination
160 mm x 200 mm x 3mm
IR reflective film
PR70 (170 mm x 210 mm)
PC for video acquisition
Motherboard: MSI X470 GAMING PLUS (sku: 4719072564278)
RAM: Crucial Kit 32GB (4 x 8GB) DDR4 2666MHz Ballistix Tactical (sku: BLT4C8G4D26AFTA)
CPU: AMD RYZEN 5 1600 3.2GHz - 3.6GHz 16MB AM4 (sku: YD1600BBAEBOX)
Graphics card: ASUS GeForce GTX 1050 TI DC2 OC 4GB GD5 (sku: 90YV0A32-M0NA00)
SSD (data): Samsung 960 Pro 2TB M.2 NVME (sku: MZ-V6P2T0BW)
Power supply: Corsair Modular CS-750W 80 PLUS GOLD (sku: CP-9020078-EU)
SSD (OS): Samsung 850 Evo 250GB SATAIII (sku: MZ-75E250B/EU)
Case: Fractal Design Define R5 Black Mid Tower Case (sku: FD-CA-DEF-R5-BK)
OS: Ubuntu 16.04