Chinese Optics Letters, 2023, 21 (6): 061103, Published Online: May. 29, 2023  

Passive non-line-of-sight imaging for moving targets with an event camera

Author Affiliations
1 Key Laboratory of Photoelectronic Imaging Technology and System of Ministry of Education of China, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
2 Beijing National Research Center for Information Science and Technology (BNRist), Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Non-line-of-sight (NLOS) imaging is an emerging technique for detecting objects behind obstacles or around corners. Recent studies on passive NLOS mainly focus on steady-state measurement and reconstruction methods, which show limitations in recognition of moving targets. To the best of our knowledge, we propose a novel event-based passive NLOS imaging method. We acquire asynchronous event-based data of the diffusion spot on the relay surface, which contains detailed dynamic information of the NLOS target, and efficiently ease the degradation caused by target movement. In addition, we demonstrate the event-based cues based on the derivation of an event-NLOS forward model. Furthermore, we propose the first event-based NLOS imaging data set, EM-NLOS, and the movement feature is extracted by time-surface representation. We compare the reconstructions through event-based data with frame-based data. The event-based method performs well on peak signal-to-noise ratio and learned perceptual image patch similarity, which is 20% and 10% better than the frame-based method.

1. Introduction

Non-line-of-sight (NLOS) imaging has attracted great attention with its widespread potential applications in object detection, autonomous driving, and anti-terrorist reconnaissance[13]. According to whether a controllable light source is used, NLOS imaging is classified as active NLOS imaging[4,5] and passive NLOS imaging[6,7].

Passive NLOS imaging shows promising application and research prospects due to its simple device and convenient data acquisition. However, the NLOS problem is known as an inverse problem in mathematics, and we need to perform blind deconvolution, which is time-consuming and computationally burdened. Consequently, light-cone transform theory using matrix inverse[8], back projection algorithm based on photon time-of-flight information[9], and wave-based phasor field approach[10] are proposed successively. But few of these methods perform well in passive NLOS moving target reconstruction because the steady-state detection mode of passive NLOS suffers from serious degradation of the diffusion spot on the relay surface[11] and the superposition effect of isotropic diffuse reflection by close pixels[12]. Currently, speckle coherence restoration[13] and intensity-based data-driven reconstruction methods[14,15] are used to solve the ill-posed passive NLOS imaging dilemma. Since the movement of the target induces motion blur to the intensity distribution on the relay surface and superposes with the obscurity caused by diffusion[16], the current end-to-end deep-learning approach[11,17] performs well only on static NLOS targets[18] but shows defects in reconstruction quality for moving targets. In contrast, to realize high quality and efficient reconstruction, we deduce the event form detection-forward model of passive NLOS and establish the event-based inverse problem, on the basis of which we first put forward the event cues for passive NLOS moving target reconstruction. In this way, the dynamic information of the intensity diffusion is precisely captured by the event detection paradigm.

2. Principle and Methods

In this section, we present the working principle of the event camera and then explain the inverse problem setup by derivation of the forward model in passive NLOS imaging.

2.1. Event-based vision

The event camera[19], known as a novel neuromorphic vision device, only responds to brightness changes per pixel asynchronously, while traditional frame-based cameras measure absolute brightness at a fixed rate. The record paradigm of event-based vision provides high temporal resolution, high dynamic range, and low power consumption[20]. Therefore, it finds great potential in challenging scenarios for standard cameras, such as high speed, high dynamic range imaging[21], or object detection[22] in a slightly changing optical field.

The working principle of the event camera is illustrated in Fig. 1. When the logarithmic intensity of the brightness changes reaches the trigger threshold, the pixel will be triggered and recorded as an event. Each pixel records the logarithmic form of the intensity when an event is fired and continuously monitors for sufficient amplitude changes based on this stored value. The event is described by four characteristic parameters: t describes the time stamp of the fired event, which records the moment when the logarithmic value of the intensity changes exceeds the threshold; x,y describes the spatial address of the pixel where the event is fired, respectively; p describes the reason for the excitation of the event, and increase or decrease in brightness, respectively.

Fig. 1. Schematic of events excitation with changes in brightness at pixel level. The red arrow stands for a fired positive event, while the green arrow stands for the negative one.

下载图片 查看所有图片

Data collected by an event camera are recorded in the form of four characteristic parameters, time stamp t, spatial address x,y, and polarity flag p. The ith event is noted as evi, evi=[ti,xi,yi,pi]T,iN.

2.2. Forward model of passive NLOS

The forward model of NLOS is a modeling expression of the light transport in data collecting, which could be regarded as the inverse process of NLOS imaging. It makes up of the theoretical basis of NLOS reconstruction from the scope of steady-state imaging and derivation.

The intensity of the diffusion spot on the relay surface[23] could be expressed as I(py)=pfFA(pf,py)·I(pf)dpf+N,where I(py) represents the intensity distribution in the field of view (FoV) on the relay surface, I(pf) represents the intensity of the self-luminous target, N represents the summation of the ambient light noise and random noise in detection, and A(pf,py) stands for the point-to-point transmission process[23], which could be specialized as A(pf,py)=cos[(pypf,npf)]·cos[(pfpy,npy)]pypf22·μ(pf,py),where (pypf,npf) and (pfpy,npy) donate the angle between the vector formed by the monitor pixel and the light source with the normal vector of each of their located plane, correspondingly. The cosine function of these two angles characterizes the effect on optical transmission by geometric relations. The Euclidean norm is used to quantify the attenuation of range. We assume that the relay surface can be approximately modeled as isotropic diffuse reflection, and coefficient μ, which describes the effect of the bidirectional reflectance distribution function (BRDF), is a constant. We substitute Eq. (3) into Eq. (2), and present the pixel-wise detection function on the relay surface[11], I(py)=pfFcos[(pypf,npf)]·cos[(pfpy,npy)]pypf22·μ·I(pf)dpf+N,which could be simplified as y=Af+N.

The detection function of diffusion on the relay surface could be written in the form of the matrix, [Iy,1Iy,h×w]=[A1,1A1,H×WAh×w,1Ah×w,H×W][If,1If,H×W]+N,where the intensity of pixel i in the FoV is expressed by the elements Iy,i, while If,i represents the intensity of pixel i of the self-luminous target. The transmission matrix A is discretized as Ai,j, and h×w, H×W are the pixel range for FoV and target, respectively.

2.3. Event-based reconstruction method

In our work, we adopt the event cues to extract the movement information and texture feature of the targets. The most intuitive representation of the dynamic information on the relay surface is the origin event data. However, as addressed in Eq. (1), an event is composed of the triggered time stamp, spatial address, and polarity flag, which belong to three different data formats, respectively. As a result, it is critical to convert the sparse 4D event points into a featured tensor that contains both temporal and spatial characteristics. Therefore, we adopt the time-surface[24] method to represent the event-based data and extract the featured diffusion spot, which contains rich information on the target movement.

We visualize the event-based data to demonstrate their representations. The relay surface is selected as a mirror for clear visualization. The target “A” is moving from left to right in the FoV, as shown in Fig. 2.

Fig. 2. Representations of event-based data.

下载图片 查看所有图片

In Fig. 2(b), the frame-based image is captured by a traditional camera, while Fig. 2(c) shows the asynchronous events captured by an event camera, consisting of 3D scatter points, which represent the parameters x, y, and t, in which red stands for positive polarity and blue stands for negative polarity. We select a short time interval as the temporal length of the voxel grid and plot the time-surface 3D map with the time-surface calculation[24], which contains both temporal and spatial correlations of the selected events, as shown in Fig. 2(e). If we normalize the time-surface value of each spatial address, project it into xoy plane, and display it in the form of gray-scale intensity, we could express the spatiotemporal correlation information in the form of an event address map[25,26], as shown in Fig. 2(d).

Based on the detection function in Eq. (6), we represent the intensity of pixel i in the FoV at the current moment by elements Iy,i in the matrix, while Iy,i represent the intensity of pixel i at the center time stamp of the adjacent voxel grid. According to the working principle of the event camera, we establish the event-based detection function by performing a difference operation to Iy,i and Iy,i, as shown, [Iy,1Iy,1Iy,2Iy,2Iy,h×wIy,h×w]=A[If,1If,1If,2If,2If,H×WIf,H×W]+NN,and the intensity of the brightness changes at pixel i is expressed as Iy,iIy,i=[Ai,1Ai,2Ai,H×W]limΔtt0·Δf+Nr=[Ai,1Ai,2Ai,H×W][If,1mIf,2mIf,H×Wm]+Nr,where t0 is the time interval between two adjacent voxel grids, and we use the limitation form to show the accumulation of the target intensity changes at pixel i during t0, which can be written as If,im, containing the movement information on the target. The noise caused by ambient light is also counteracted by the sampling principle of event-based detection, and only random noise remains, i.e., Nr=NN.

Then, we utilize the step function Fire(IyIy) to discriminate whether an event is fired. As shown in Fig. 3, when the difference value of intensity exceeds the threshold and falls in the red or blue area, an event is recorded at that pixel and outputs [ti,xi,yi,pi].

Fig. 3. Discrimination function of fired event.

下载图片 查看所有图片

After extracting the dynamic information on the diffusion spot movements, we put forward an event-embedded framework that fuses the extracted event features with a UNet structure to solve the inverse problem. As shown in Fig. 4, we display a video containing a parallel moving target with a smartphone, and leverage the event-based vision to record the dynamic diffusion spot on the relay surface.

Fig. 4. Flow chart of event-embedded passive NLOS imaging.

下载图片 查看所有图片

We perform a time-surface calculation on the voxel grid[21] to extract the featured diffusion spot and represent event data at different time intervals by a series of 2D intensity images. Time-surface[24] is expressed by Si(ρ,p)=e[tiTi(ρ,p)]/τ,where Si is the time-surface value of evi, which is defined by applying an exponential decay kernel with time constant τ on the values of the context time stamp Ti(ρ,p), where p is the polarity flag. Ti is defined to represent the time-context information around an incoming event evi as the array of most recent events times at ti for the neighboring pixels in the adjacent area with the radius of ρ, Ti(ρ,p)=maxji{tj|rj=ri+ρ,pi=p}.

According to the assumption that the fired events of the adjacent voxel grids share a similar spatial address, the spatial address of the fired event stream is highly correlated. As a result, when we perform a time-surface calculation on the acquired event data in a voxel grid, the context information on the movements is accumulated. Then, the temporal and spatial features of the moving target can be expressed by Emap=S[yFoV,Δtt0Fire(IyIy)]=B·f,where Fire(·) is referred to in Fig. 3, S stands for the time-surface representation, and B is the counterpart of A.

The event-based inverse problem is established by f=B1·Emap.

A typical solution of this reconstruction is using optimal methods to solve the matrix inverse of B, as shown, f=argminfBfEmap22+J,where J donates the a priori, which is obtained by the asynchronous sampling paradigm of the event camera.

However, the condition number of matrix B is relatively large due to the pixel-wise mutual interference, resulting in the low rank of matrix B. Also, iteratively solving the calculation of B1 is time-consuming. For the last step of solving the inverse problem in NLOS reconstruction, we employ a UNet structure, use skip connection to perform multiscale feature fusion, and add a residual block to avoid gradient explosion when training.

3. Experimental Setups

For the experimental proof of the event-based approach, we constructed experimental setups, as shown in Fig. 5. We displayed a video that provided the self-luminous moving target in the NLOS region of the event camera, blocked by the obstacle. The moving diffusion spot on the relay surface is recorded by the event camera (CeleX-V) in Section 1 of the Supplementary Material and Visualization 1.

Fig. 5. Experimental setup. (a) Basic principle of our NLOS scene; (b) experimental settings; the self-luminous target is a video with moving digits.

下载图片 查看所有图片

The targets used in the experiment contain characters of number digits selected from the MNIST training set, MNIST test set, and PRINT test set (Arial font numbers), with the size of 3cm×3cm, which are placed 25 cm away from the frosted aluminum fender.

We select 14 different kinds of characters for each digit (0–9) in the MNIST training set and test set, and then acquire both event-based data and frame-based data with different modes of the CeleX-V camera. The self-luminous target displayed by the smartphone translates from left to right in the FoV at a preset speed of 2.5 cm/s. When recording the moving diffusion spot in full-picture (F) mode, we get series of screenshots in different positions with the frame rate of 100 frames per second, while in event-intensity (EI) mode, we get a stream of event-based data of the diffusion spot movement. Data collected by these two modes are calibrated to the ground truth by the time stamp and made into image format data sets event MNIST NLOS (EM-NLOS) and frame MNIST NLOS (FM-NLOS), correspondingly.

To the best of our knowledge, we first established the EM-NLOS data set, which contains 4080 images in the training set and validation set and 210 images in the test set. The training set is made up of 3950 featured events. The time-surface map has 130 targets (13 groups, 0–9) at different positions, while the validation set contains 130 images. The test set consists of 110 images with 10 digits (0–9) selected from the MNIST test set and 100 images with 10 digits (0–9) in Arial font. As the counterpart, FM-NLOS contains the corresponding frame-based intensity diffusion spot movement, with 4180 images in total. We compare the training results on EM-NLOS and FM-NLOS, which are noted as an event-based method (E method) and a frame-based method (F method), respectively. The reconstruction accuracies of the F method and the E method on the MNIST test set and the PRINT test set are shown and compared in Fig. 6.

Fig. 6. (a) Part of the reconstruction results for the PRINT test set in both EM-NLOS and FM-NLOS; (b) part of the reconstruction results for the MNIST test set in both EM-NLOS and FM-NLOS.

下载图片 查看所有图片

We trained our residual-UNet (R-UNet) on the EM-NLOS training set with an adaptive moment (Adam) estimation optimizer with Nvidia RTX 3090 GPU for 800 epochs. To achieve fair comparisons, we trained the R-UNet on the FM-NLOS training set with the same configuration[17]. The structure of our R-UNet and training parameters are given in Section 2 of the Supplementary Material.

4. Experimental Results and Discussions

The reconstruction quality of the moving target is assessed from two perspectives: the visual reconstruction quality and the position accuracy. For the former, we introduce the peak signal-to-noise ratio (PSNR) and learned perceptual image patch similarity (LPIPS)[27] to evaluate the reconstructions. The reconstruction accuracies of the F method and the E method on the MNIST test set and the PRINT test set are shown and compared in Figs. 6(a) and 6(b), respectively. It is obvious that the proposed E method with event-based data shows much better reconstruction quality than the F method, especially in recognizing the digits.

As for the position accuracy, we define the index contour distance (Cd) to measure the position of the reconstructions. The Cd value is evaluated by the average distance between the left edge and the digit left contour (made up by first pixels with a gray scale of 255 in each row after image binarization). Digit 7 in the MNIST test set and digit 3 in the PRINT test set are demonstrated as examples in this Letter. As shown in Fig. 7, the reconstruction by the E method is closer to ground truth than that of the F method. The average Cd deviation of the E method is far smaller than that of the F method, as shown in Fig. 8.

Fig. 7. Reconstructions of NLOS moving target at different positions through the E method and the F method. Six different positions of digit 7 (MNIST test set) and digit 3 (PRINT test set) are displayed as examples. The Cd value (pixel) is labeled at the corner of each frame.

下载图片 查看所有图片

Fig. 8. Cd value of NLOS reconstructions at different positions. (a), (b) are the Cd values of reconstructions shown in Fig. 7 (digit 7 and digit 3, respectively).

下载图片 查看所有图片

Furthermore, the visual reconstruction accuracy of the E method is also intuitively higher than that of the F method. One can see from Fig. 9 that the E method evidently performs better on both of the two metrics, indicating that the event-based approach exceeds the frame-based method under the same data set size and network structure.

Fig. 9. Evaluation metrics LPIPS and PSNR for reconstructions of digit 7 (MNIST test) and digit 3 (PRINT test) at 10 different positions, respectively. The full line denotes the E method, while the dotted line denotes frame-based ones.

下载图片 查看所有图片

We statistically analyze the reconstruction accuracy indexes of 10 digits in both the MNIST test set and the PRINT test set at different positions. The average LPIPS and PSNR of each reconstructed frame for different test digits with the E method and the F method are shown in Table 1, respectively. One can see from Table 1 that the reconstruction LPIPS obtained by the E method is 38% and 11% lower than by the F method on the two test sets, respectively, which demonstrate the higher quality of human perception. As for the PSNR, the E method scores higher than the F method on every test target and performs about 10% better numerically.

Table 1. Evaluation Metrics of Results with E Method and F Methoda

DigitPSNR/dB (↑)LPIPS (↓)
MNIST testPRINT testMNIST testPRINT test
E (ours)FE (ours)FE (ours)FE (ours)F


As for the generalization, see Section 3 of the Supplementary Material for discussions about target movement and Visualization 2 for the possibility of applying event-based NLOS imaging in real-world circumstances.

5. Conclusion

In summary, we leverage the sampling specialty of an event camera and propose a new detection and reconstruction method for passive NLOS imaging. The E method extracts rich dynamic information from the diffusion spot movements and provides a physical foundation for passive NLOS imaging of moving targets. Compared with the deep-learning approach with a traditional camera, the event-based framework shows better performance when reconstructing NLOS moving targets. We carry out experiments on two types of targets with different distribution forms and verify that the reconstruction quality is significantly improved with the framework we addressed in both visual accuracy and position accuracy. The reconstruction quality on the PRINT test set indicates that our method has extracted more movement information on moving targets with the event detection paradigm compared with traditional frame-based detection. We believe that the event approach for the inverse problem, together with the EM-NLOS data set, is a big step and can inspire new ideas toward the development of feature-embedded passive NLOS imaging with multidetector information fusion[12] and NLOS target tracking[28]. The event cues we demonstrated fuse the event paradigm information of NLOS moving target with end-to-end data-driven methods for solving the event-based inverse problem. In future work, we will take target movements and environment disturbance into consideration, and continually put forward the applications of event-based cues in practice by providing enhancement for methods based on other dimensions of the light field. The event-based vision utilized in this work has great potential to facilitate further research on feature-embedded passive NLOS and its applications.


[1] D. Faccio, A. Velten, G. Wetzstein. Non-line-of-sight imaging. Nat. Rev. Phys., 2020, 2: 318.

[2] KirmaniA.HutchisonT.DavisJ.RaskarR., “Looking around the corner using transient imaging,” in IEEE 12th International Conference on Computer Vision (ICCV) (2009), p. 159.

[3] MaedaT.SatatG.SwedishT.SinhaL.RaskarR., “Recent advances in imaging around corners,” arXiv:1910.05613 (2019).

[4] M. La Manna, F. Kine, E. Breitbach, J. Jackson, T. Sultan, A. Velten. Error backprojection algorithms for non-line-of-sight imaging. IEEE Trans. Pattern Anal. Mach. Intell., 2019, 41: 1615.

[5] M. O’Toole, D. B. Lindell, G. Wetzstein. Confocal non-line-of-sight imaging based on the light-cone transform. Nature, 2018, 555: 338.

[6] T. Sasaki, C. Hashemi, J. R. Leger. Passive 3D location estimation of non-line-of-sight objects from a scattered thermal infrared light field. Opt. Express, 2021, 29: 43642.

[7] J. Boger-Lombard, O. Katz. Passive optical time-of-flight for non-line-of-sight localization. Nat. Commun., 2019, 10: 3343.

[8] C. Pei, A. Zhang, Y. Deng, F. Xu, J. Wu, D. U.-L. Li, H. Qiao, L. Fang, Q. Dai. Dynamic non-line-of-sight imaging system based on the optimization of point spread functions. Opt. Express, 2021, 29: 32349.

[9] A. Velten, T. Willwacher, O. Gupta, A. Veeraraghavan, M. G. Bawendi, R. Raskar. Recovering three-dimensional shape around a corner using ultrafast time-of-flight imaging. Nat. Commun., 2012, 3: 745.

[10] X. Liu, I. Guillén, M. La Manna, J. H. Nam, S. A. Reza, T. H. Le, A. Jarabo, D. Gutierrez, A. Velten. Non-line-of-sight imaging using phasor-field virtual wave optics. Nature, 2019, 572: 620.

[11] R. Geng, Y. Hu, Z. Lu, C. Yu, H. Li, H. Zhang, Y. Chen. Passive non-line-of-sight imaging using optimal transport. IEEE Trans. Image Process., 2021, 31: 110.

[12] TanakaK.MukaigawaY.KadambiA., “Polarized non-line-of-sight imaging,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), p. 2136.

[13] O. Katz, P. Heidmann, M. Fink, S. Gigan. Non-invasive singleshot imaging through scattering layers and around corners via speckle correlations. Nat. Photonics, 2014, 8: 784.

[14] TancikM.SwedishT.SatatG.RaskarR., “Data-driven non-line-of-sight imaging with a traditional camera,” in Imaging and Applied Optics (2018), paper IW2B.6.

[15] J. He, S. Wu, R. Wei, Y. Zhang. Non-line-of-sight imaging and tracking of moving objects based on deep learning. Opt. Express, 2022, 30: 16758.

[16] C. A. Metzler, F. Heide, P. Rangarajan, M. M. Balaji, A. Viswanath, A. Veeraraghavan, R. G. Baraniuk. Deep-inverse correlography: towards real-time high-resolution non-line-of-sight imaging. Optica, 2020, 7: 63.

[17] ZhouC.WangC.-Y.LiuZ., “Non-line-of-sight imaging off a phong surface through deep learning,” arXiv:2005.00007 (2020).

[18] A. Zhang, J. Wu, J. Suo, L. Fang, H. Qiao, D. D.-U. Li, S. Zhang, J. Fan, D. Qi, Q. Dai, C. Pei. Single-shot compressed ultrafast photography based on U-net network. Opt. Express, 2020, 28: 39299.

[19] G. Gallego, T. Delbruck, G. M. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, D. Scaramuzza. Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell., 2020, 44: 154.

[20] AmirA.TabaB.BergD.MelanoT.McKinstryJ.Di NolfoC.NayakT.AndreopoulosA.GarreauG.MendozaM.KusnitzJ.DeboleM.EsserS.DelbruckT.FlicknerM.ModhaD., “A low power, fully event-based gesture recognition system,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), p. 7388.

[21] H. Rebecq, R. Ranftl, V. Koltun, D. Scaramuzza. High speed and high dynamic range video with an event camera. IEEE Trans. Pattern Anal. Mach. Intell., 2021, 43: 1964.

[22] SchaeferS.GehrigD.ScaramuzzaD., “AEGNN: asynchronous event-based graph neural networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), p. 12361.

[23] C. Saunders, J. Murray-Bruce, V. K. Goyal. Computational periscopy with an ordinary digital camera. Nature, 2019, 565: 472.

[24] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, R. B. Benosman. HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39: 1346.

[25] C. Yan, X. Wang, X. Zhang, X. Li. Adaptive event address map denoising for event cameras. IEEE Sens. J., 2022, 22: 3417.

[26] C. Wang, X. Wang, C. Yan, K. Ma. Feature representation and compression methods for event-based data. IEEE Sens. J., 2023, 23: 5109.

[27] ZhangR.IsolaP.EfrosA. A.ShechtmanE.WangO., “The unreasonable effectiveness of deep features as a perceptual metric,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), p. 586.

[28] S. Chan, R. E. Warburton, G. Gariepy, J. Leach, D. Faccio. Non-line-of-sight tracking of people at long range. Opt. Express, 2017, 25: 10109.

Conghe Wang, Yutong He, Xia Wang, Honghao Huang, Changda Yan, Xin Zhang, Hongwei Chen. Passive non-line-of-sight imaging for moving targets with an event camera[J]. Chinese Optics Letters, 2023, 21(6): 061103.

引用该论文: TXT   |   EndNote



关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。