Walabot is kind of FMCW(Frequency Modulated Continuous Wave) radar. It senses environment by transmitting, receiving and recording signals from MIMO antennas. The frequency range from 3.3-10 GHz. Today I am going to show you how walabot collect 3D images.
Our radar sensing platform emits probing pulse signals x(t) at a pulse repetition frequency (PRF) of 16 Hz, but within each pulse repetition interval (PRI), the receiver antenna samples the received signal y(t) at a very high frequency of 8 KHz. The x-axis of raw signals is response time, while y-axis means amplitude at that time slot. Red line is actually samples what receiver antenna sampling. In other words, response time is signal traverse time from transmit antenna to receive antenna, which depends on distance to radar. Higher amplitude at specific time, means there is object at that place.
While 2D images only shows φ (wide angle) versus R;. However, any object in real world is 3D, it has height as well, then introduce θ (elevation angle) as height. To get 3D image, let's see the axis system in walabot.
X = R*Sinθ
Y = R*CosθSinφ
Z = R*CosθCosφ
Instead of using 2D images, we construct 3D images based on those 2D images by stacking them in vertical direction.
Figure above shows how to stack 2D images, the implement process are
- Concatenate 2D images to 3D matrix
measure.marching_cubes_classicto make vertices and faces
- Change axis ranges from interval index to real unit
(0,100) => (1,200) R(cm)
(0,61) => (-90,90) φ(degree)
(0,9) => (-20,20) θ(degree)
Poly3DCollectionto get mesh
Once get 3D image, we save it to IO buffer, and use PIL to open buffer, then convert it to ndarray, write ndarray as one frame to video by using openCV
Video above shows a human walk around walabot radar.
CNN extract Features
For each frame, we use resnet-18 to extract features, we change last Average pooling to Max pooling because Max pooling extracts the most important features like edges whereas, average pooling extracts features so smoothly. For image data, you can see the difference. Although both are used for same reason, I think max pooling is better for extracting the extreme features. Average pooling sometimes can’t extract good features because it takes all into count and results an average value which may/may not be important for object detection type tasks. Then change output linear layer to extract 10 features.
We use 3 frames to recognize one activity. Since 3D radar signal shown above is abstract video. Unlike camera videos which each frame represent a activity, radar video can only be detected by continuous frames change to recognize one activity. We collect many 3 frames video for each activity as training data. When testing, given continuous stream radar video, and feed every 3 frames to LSTM network, then gives our prediction for each frame. Detailed technique will be present in my next paper.