CS 639 @ UW-Madison: Computer Vision (Fall 2020) Course Project
Haochen Shi, Rui Pan, Chenhao Lu
The idea of working on this project came from one of our team members who has been working in our school's robotics lab. We thought it was a good idea to explore the gap between computer vision algorithms and real world robotics applications. It is usually hard to apply a theory to real-world applications, so we are curious about how hard it is in the case of computer vision and robotics. After studying the literature in the intersection of vision and robotics, we find that vision has been widely used on robots such as drones, autonoumous cars, and social robots. There are not many interests on how to intergrate vision algorithms with high degree of freedom robot arms. But given that these robot arms have a wide range of capbilities such as spoon feeding, garbage classification, heart surgery, we feel that it can be very helpful to investigate the compability of vision plugins on this kind of robots.
We did not have access to a real robot because of the COVID-19 pandemic, so all the implementations and experiments are carried out in a simulated environment in CoppeliaSim. The simulator provides a mature physics engine which simulates gravity and friction in a realistic manner.
We used Rethink Robotics Sawyer, a 7-DOF robot.
We used RelaxedIK, a motion planning platform, to calculate the motion of the robot arm. RelaxedIK maps a 6-DOF pose goal (position + rotation) to a robot configuration in a fast and accurate way.
The CPU we used for the experiments is an AMD Ryzen 7 2700X Eight-Core Processor 3.70GHz, and we did not have access to a GPU.
An overview of the pipeline
For stitching the images, we used the OpenCV built-in image stitcher, which is well-modularized and easy to use.
We program the robot arm to rotate at a fixed speed autonomously to capture a stable 180 degree panorama. During the rotation, we used the camera attached to the end effector of the Sawyer robot to capture images with an interval of 0.3s. See the video below for a glimpse of how the robot arm rotates.
A montage of the frames captured by the robot arm
Panorama generated by the stitcher
We also asked some testers without prior experience to control the robot arm to take panoramas with an XBox controller. In order to test how intuitive this control method is, we did not provide any instructions such as the function of each button on the controller. It takes them about 5 minutes to learn the controls and take a stable panorama as the one above.
In terms of potential applications, when some field robots explore unknown environments, it is useful to take a panorama of their surroundings to gather useful information.
Our first attempt is based on PyTorch, torchvision, and the Mask-RCNN model. Mask-RCNN is a model that predicts both the bounding boxes and the class scores for potential objects in the image. The model was pre-trained on the COCO dataset and had an initial size of 178 MB. We fine-tuned the model for better human body recognition with the Penn-Fudan database. The dataset contained 170 images with 345 instances of pedestrians, and each image had a corresponding segmentation mask. Each frame captured by the camera is processed by the model for object detection.
An overview of the COCO dataset
An example of a segmentation mask in the Penn-Fudan database
With only CPU, the latency for processing a single 50 KB image is around 4 seconds, which is pretty bad for our goal of real-time detection, where we expected millisecond-level turnaround time. So we switched to a new approach.
The first change we made was to switch to a new detection framework, Detectron 2. It is a state-of-the-art object detection framework by Facebook Research, powered by pytorch, and has a wider range of features and a faster speed. The latency improved but was still high on CPUs, so we looked into solutions using GPUs.
We did some benchmark experiments on Microsoft Azure and found that these two modifications combined resulted in a 200x performance increase with a final latency of 16 milliseconds, which is equivalent to 62 frames per second. These benchmarks proved the viability for our implementation to be used in real time. The state-of-art real-time object detection method on a Nao robot has about 0.1 s turnaround time with a single GPU. It has close performance to our method, and the variance is caused by image size, GPU, etc.
A comparison of the latency for processing a 50KB image on different framework and hardware setups
Since we could not install the robotics software on the azure remote machine, we did our implementation on CPU-only machines which led to about 2-second latency. Above is a demo showing our implementation of the object detection functionality.
Since the provided meshes in the simulation software have a relatively small number of polygons, and our model is trained based on real-world pictures, it fails to detect some objects in the experiment such as the poorly made human model. Therefore, we expect this method to work better in a real-world setting than in the simulator.
See below for a demo.
We first used the mean shift algorithm implemented by OpenCV, which is a histogram based template tracking algorithm that iteratively moves the window until convergence. The major problem with this approach is that the window size and rotation are fixed, so it’s not robust to changes in the size and orientation of the object.
Therefore, we switched to the Cam-Shift algorithm. It’s an algorithm based on mean shift, but also updates the size and orientation of the window upon convergence. This change successfully resolved the issue.
In the demo below, we used the XBox game controller to control the robot in this demo. The tracking window size is constantly changing to fit the human figure. And as you can see, our method is robust to changes in orientation and scale.
Although we encountered many issues when setting up the simulation environemnt, connecting the simulated robot to different vision APIs, and achieving desired performance with those APIs, we are very glad that we figured everything out! Here are our conclusions of this project: