Walabot Shows 3D Image

Walabot senses environment by transmitting, receiving and recording signals from MIMO antennas. The frequency range from 3.3-10 GHz. Today I am going to show you how walabot collect 3D images.
Raw signals
Our radar sensing platform emits probing pulse signals x(t) at a pulse repetition frequency (PRF) of 16 Hz, but within each pulse repetition interval (PRI), the receiver antenna samples the received signal y(t) at a very high frequency of 8 KHz. The x-axis of raw signals is response time, while y-axis means amplitude at that time slot. Red line is actually samples what receiver antenna sampling. In other words, response time is signal traverse time from transmit antenna to receive antenna, which depends on distance to radar. Higher amplitude at specific time, means there is object at that place.
2D images

While 2D images only shows φ (wide angle) versus R;. However, any object in real world is 3D, it has height as well, then introduce θ (elevation angle) as height. To get 3D image, let's see the axis system in walabot.

where
X = R*Sinθ
Y = R*CosθSinφ
Z = R*CosθCosφ
3D images
Instead of using 2D images, we construct 3D images based on those 2D images by stacking them in vertical direction.

Figure above shows how to stack 2D images, the implement process are
- Concatenate 2D images to 3D matrix
- Use
measure.marching_cubes_classic
to make vertices and faces - Change axis ranges from interval index to real unit
For example:
(0,100) => (1,200) R(cm)
(0,61) => (-90,90) φ(degree)
(0,9) => (-20,20) θ(degree) - Use
Poly3DCollection
to get mesh
3D videos
Once get 3D image, we save it to IO buffer, and use PIL to open buffer, then convert it to ndarray, write ndarray as one frame to video by using openCV
Video above shows a human walk around walabot radar.
CNN extract Features
For each frame, we use resnet-18 to extract features, we change last Average pooling to Max pooling because Max pooling extracts the most important features like edges whereas, average pooling extracts features so smoothly. For image data, you can see the difference. Although both are used for same reason, I think max pooling is better for extracting the extreme features. Average pooling sometimes can’t extract good features because it takes all into count and results an average value which may/may not be important for object detection type tasks. Then change output linear layer to extract 10 features.
LSTM training
We use 3 frames to recognize one activity. Since 3D radar signal shown above is abstract video. Unlike camera videos which each frame represent a activity, radar video can only be detected by
continuous frames change to recognize one activity. We collect many 3 frames video for each activity as training data. When testing, given continuous stream radar video, and feed every 3 frames to
LSTM network, then gives our prediction for each frame.
Detailed technique will be present in my next paper.