Mid-term Report

Group members: Haochen Shi, Chenhao Lu, Rui Pan

{hshi74, clu92, rpan33}@wisc.edu


In our project proposal, we suggested 3 important vision tasks that we were aiming to accomplish with our RelaxedIK framework: taking a stable panoramic picture, detecting objects in real-time, and tracking moving objects, in the order of increasing difficulties. After we start our implementation and re-evaluate the challenges of these tasks, we are confident that it is feasible to complete all of them by the end of the semester. In the mid-term report, we will show our preliminary results of the first two tasks.

Taking a panorama

A real robot is not accessible to us due to the Covid-19 pandemic, so we decided to carry out our implementation and experiments in a simulated environment as shown in Fig. 1. For this task, we tested a Rethink Robotics Sawyer (7-DOF) in a scene simulated in CoppeliaSim [1].


Fig. 1. A screenshot of the simulated scene in CoppeliaSim

We used two control methods to test our implementation: (1) A user uses an Xbox game controller (also known as a joypad) to interactively drive the Sawyer robot. (2) Make the Sawyer robot automatically follow a motion path in the cartesian space.

In both methods, we controlled the Sawyer robot to rotate 150 degrees around its wrist (link 7) at a fixed position. During the rotation, we used the camera attached to the end effector of the Sawyer robot to capture images with an interval of 0.3s. After we obtained the images, we combined them into a panorama with the OpenCV stitcher [7]. The images before and after stitching are shown in Fig. 2 and 3.


Fig. 2. Images captured by the camera robot


Fig. 3. The panorama after stitching

We also did a very simple user study with a sample size of 3. In this study, each user had some knowledge about how to use a game controller but no prior experience of manipulating a robot. We asked them to control the Sawyer robot to take a panorama using our method (1). We didn't provide any instructions (such as how to use each button on the game controller) for them in order to test how intuitive our interactive control method is. It takes them about 5 minutes on average to complete the task, and our method is considered relatively easy to use based on their feedback.

Object Detection

In this part, we added the support for real-time object detection with our RelaxedIK framework.


Fig. 4. An exmaple of segmentation mask in the database

We loaded a ResNet50 [5] model for Mask R-CNN [6] (fasterrcnn_resnet50_fpn) from the torchvision module. The model was pre-trained on the COCO dataset [4] and had an initial size of 178 MB. We fine-tuned the model for human body recognition with the Penn-Fudan database [3]. The dataset contained 170 images with 345 instances of pedestrians, and each image had a corresponding segmentation mask (as shown in Fig. 4.). Note that we would also need to detect other types of objects in the environment (e.g. desks, chairs, plants, etc.), but as a sanity check, we focused on human body recognition in this demo. For the PyTorch optimizer, we used a stochastic gradient descent (torch.optim.SGD) method with learning rate=0.005 and momentum=0.9. We also utilized a learning rate scheduler which decreases the learning rate by 10x every 3 epochs for increased performance and faster training. We trained for 10 epochs and did an evaluation for every epoch. The model has a size of 351 MB after the training.

After the training phase, we containerized the model in a Predictor class so that we could simply call a prediction function to invoke the forward pass function. A sample input/output is shown in the picture below.


Fig. 5. Real-time input and output of the camera attached to the robot

In our experiment, each transformation (from the raw image to a segmentation mask) takes about 2s, which leads to high latency. To achieve our goal of smooth real-time object detection, we may need to look into implementations of low latency real-time object detections such as described in this tutorial [8]. Our goal is to detect objects at a rate of at least 100hz (10ms).

Future Improvements

  • Right now our model only supports human body recognition. In the second half of this project, we will try to add support for other types of objects in the environment.

  • So far we have only tested our implementation in a simulated scene with a simple indoor setup. Although we are constrained to work in simulations due to the pandemic, we aim to incorporate more sophisticated scenes that resemble real-life scenarios (e.g., factories, more complicated indoor scenes, etc.) into our testbed.