Humans and other animals have long been capable of detecting depth in three dimensions with the help of binocular vision. The goal of computer stereo vision is to give a machine the same capabilities. It makes it possible to rebuild a three-dimensional depiction of the environment. In keeping with that, this study will first give the reader an overview of how people perceive three-dimensional space. Then, we will examine how computer stereo vision extracts depth from images. This process is akin to human binocular vision. Lastly, it will examine the creation of 3D representations from 2D photographs. This is a significant use of depth information extraction.
Binocular Vision Transition From 2D to 3D
The primary benefit of binocular vision for living things is the capacity to sense depth that two-dimensional images convey. The human brain matches the features in the images to reconstruct a three-dimensional shape of an object. This process is called stereopsis. It occurs when image inputs from both eyes project onto the retinas of the eyes.
Stereopsis requires binocular cues, or input from both eyes, which sets it apart from other eye-based depth information extraction methods. Animals can feel the distance to items nearby. They do this in a number of ways – heir eyes are not the only means they use [1]. Some of the animals have eyes that stare nearly in opposite directions. This positioning reduces their depth perception. However, it increases their field of vision [2].
When only one eye receives information, the brain may only be able to rebuild the 3D image of the world if it has previously seen such objects in three dimensions. After that, it will automatically give a 2D image the missing depth (see Figure 1). To determine the distance between the viewer (or camera) and the object without such experience, one must rely on monocular cues. Motion parallax is the primary strategy employed here. It states that objects closer to the viewer move more quickly and appear smaller.
Decoding Stereoscopic Images
Figure 2 illustrates a variety of 2D patterns. The human brain might perceive these patterns as leaves or tree branches. This perception gives the image different levels of depth. Experience automatically adds depth for you. The brain can, however, wire the depth information differently. It will show a giraffe if one focuses their eyes just beyond the image. You stop seeing the giraffe’s 3D representation if you close or obscure one of the eyes. This demonstrates how stereopsis, which necessitates simultaneous input to both eyes, specifically links to binocular vision.
It requires time and trial and error to go from 3D leaves to 3D giraffe in Figure 2. However, the shift from 2D to 3D typically begins early in infancy. Sue Barry is a professor of neuroscience. She claims that the brain can learn to see in three dimensions only with accurate input from both eyes [3]. Stereo blindness is when people with unusually developed eyesight cannot discern depth. They fail to perceive depth from the two pictures they project onto their retina.
This project’s subsequent sections will describe how to use the OpenCV library to accomplish computer stereo vision.
Data Collection in Stereo Vision
Stereoscopic photographs make up the majority of the data used in this study. Pairs of previously corrected images of the same scene also contribute significantly.
Because camera calibration and hardware setup might be difficult and time-consuming, real-time camera captures are not used [4]. There is already a huge collection of these photographs available online. They are ready for examination. There is no need to create them all by hand. However, we use pictures from a USB-connected camera and the Microsoft built-in Windows 10 Camera app to test some of the ideas.
Data Processing in Stereo Vision
The StereoBM class in OpenCV offers a method for obtaining a depth map from two pictures. Nevertheless, the undistorted and corrected 8-bit single-channel pictures are necessary for the StereoBM API [5]. Fortunately, OpenCV provides function calls for all the essential pre-processing. This is helpful because not all photos follow that format.
RGB to Grayscale conversion
The first step involves converting 3-channel RGB image representations into 1-channel Grayscale. OpenCV offers an image transformation function called cvtColor() to automate this process. The Luma coding formula is provided in the documentation [6]. The function outputs the resultant matrix containing single-channel 8-bit values, which we then use for block matching.
Block Matching
You calculate disparity map with OpenCV using a technique that uses block matching stereo correspondence. The algorithm searches for a block of pixels in the left image. Then, it attempts to locate this block in the right image. You need fewer tests to discover the matching locations in two photos. This is due to an optimization technique known as an epipolar constraint [7]. To use this optimization, you must know the orientations and positions of two cameras. Alternatively, you must already have corrected the generated images. They should line up in a single plane. In the latter scenario, the right image’s epiline will always be horizontal. This helps separate image depth estimation from camera intrinsics.
The method will use block similarity measurements. It will determine which candidate in a group for a given block has the highest similarity score.
Disparity Calculation
The disparity is easy to compute. It is the shift required to move the block on the left picture to the position in the right image. This happens once the matching block has been located. The following formula states that the disparity value is inversely proportional to the depth value at that pixel:
𝑑𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦 = x−x′ = Bf / 𝑍
The depth value is Z. Please consult the diagram in [8] for a comprehensive explanation and visualization. As an aside, the parallax scrolling illustrates the converse relationship. The farther an object is from the observer, the less it shifts.
Disparity Map
A disparity map for the left picture is created after you compute disparities. Each pixel on the left picture receives a disparity value. This can be seen as a grayscale image. Lighter pixels appear nearer to the viewer than darker ones (Figure 3).
From Stereo Vision to 3D Reconstruction
We cannot determine the exact values of B and f. We only have the two photos and no understanding of camera setup. It’s crucial to remember that only Z and disparity vary between pixels; B and f remain fixed. As a result, the difference itself can be thought of as depth times a certain factor. This creates a depth map by providing the accurate depth values of pixels in relation to one another.
One of the pictures in the stereo pair—typically the left one—is linked to the depth map. You use the other picture to determine the depth of the original image or its projection onto a shared plane. Consequently, you can fully describe a point in a 3D scene by the colour, two additional coordinates from the original image and the depth value. You can recreate a 3D representation of a scene using the group of these points. This group is sometimes referred to as a 3D point cloud (Figure 4).
Consequently, the computer can already determine the relative depth of pixels even when using only two photos. This method works well for navigation. For example, NASA rovers navigate and manipulate Mars using a similar concept. This concept is called elevation mapping and is based on stereo vision [9].
Experimental Results and Comparisons
You can observe the impact of different OpenCV parameters with a Java application, which is part of this project. It allows for depth analysis of one’s own or third-party photos. Additionally, you can extract patterns from stereoscopic images.
SADWindowSize parameter
Numerous parameters adjust the StereoBM class performance and accuracy. These include the number of disparities, SADWindowSize, minimal disparity, speckle range, speckle window size, and many more.
SADWindowSize determines the size of each pixel block that the block-matching algorithm will compare in each stereo pair. The method will discover matching blocks more effectively with larger numbers. However, the final resolution will be lower due to the large blocks (see Figure 5). Larger values will eliminate noise but add blurriness.
Real-time Capture with Stereo Vision
When using a single camera to capture depth in real-time, keep the camera’s orientation constant. It should move in a straight line, known as a baseline. Take pictures at tiny intervals. By doing this, the final images are guaranteed to be appropriate for depth analysis (see Figure 6).
However, to improve accuracy, two cameras should be combined into a bundle and their properties used to alter images for block matching.
Stereoscopic Images
Verifying the patterns encoded into stereoscopic images was the original goal of the Java application (see Figure 8). Although these depth patterns are visible to the naked eye, an application can automatically generate the disparity map. This serves as a proof of concept. It accomplishes this by attempting to match the left and right parts of a stereoscopic image. When the widths of the segments vary, they will match. A scrollbar is used to search for width. The term “Image Shift” displays the current width of each section (see Figure 3).
The camera might be rotated around a scene. This is an alternative to travelling along its base line. Many additional use cases could be evaluated. A comparison of the depth accuracies might then be made. These jobs are not simple. Additional gear and computer setup are needed for accurate camera movement. Automatic image submission for analysis is also required.
Further Research
Drawing a comparison with human binocular vision at the outset of this study was crucial. It can offer valuable insights into the potential and limitations of computer stereo vision as well. For instance, a computer may find matching points between two photos more easily. This happens when two cameras project and align the images within the same plane. Similarly, the position, orientation, and focus of the cameras affect the precision. These factors significantly impact the level of accuracy when measuring depth.
It is proposed that computer vision may draw inspiration from the biological visual system for future research. For instance, binocular disparity may disrupt a hidden creature’s camouflage in the animal kingdom. It would be helpful if computer vision could also accomplish this. It is intriguing to examine if computer vision could learn to add dimension to a single 2D image. This could be achieved using object detection and recognition. It seems that the human brain connects depth values with object forms. It uses silhouettes with a neural network structure of some kind.
Conclusion
In computer vision, stereo vision is but one component of more intricate systems. Extracting depth information is a small part of transforming 2D photos into printed 3D models. This is especially true for 3D object reconstruction. Converting point clouds into mesh models and combining multi-view depth data for 3D model production are further stages. However, even in the realm of stereo vision, one must consider several aspects. These include picture block similarity metrics and search optimization. Essential and transformation matrices are also important. Camera calibration and image rectification must be addressed to identify matching points on the two images.
A variety of computer vision technologies have abstracted away most of these intricacies. This allows users to focus on putting their ideas into practice rather than creating algorithms in a lab. To complete all common activities, OpenCV provides API methods. For picture depth analysis, the StereoBM class utilised in this project proved sufficient. The StereoSGBM class and additional API functions serve the more complex use cases.
Stereo vision is a powerful technique that allows machines to perceive depth and reconstruct 3D environments by mimicking human binocular vision. By leveraging concepts like disparity mapping and 3D reconstruction, stereo vision has become a cornerstone of modern computer vision applications. It enables advancements in robotics, autonomous systems, and AI-driven technologies, transforming how machines interact with the physical world.
References
- Howard, I. & Rogers, B. (2012). Perceiving in depth. New York: Oxford University Press.
- Howard, I. & Rogers, B. (1995). Binocular vision and stereopsis. New York: Oxford University Press.
- Barry S (2009). Fixing My Gaze: A Scientist’s Journey into Seeing in Three Dimensions. New York: Basic Books.
- Armea, A. (2017). Calculating a depth map from a stereo camera with OpenCV. Retrieved September 13, 2018, from https://albertarmea.com/post/opencv-stereo-camera/
- Camera Calibration and 3D Reconstruction. Retrieved September, 2018, from https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html
- Miscellaneous Image Transformations. Retrieved September, 2018, from https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html
- Epipolar Geometry. Retrieved September, 2018, from https://docs.opencv.org/3.1.0/da/de9/tutorial_py_epipolar_geometry.html
- Depth map from Stereo Images. Retrieved September, 2018, from https://docs.opencv.org/3.1.0/dd/d53/tutorial_py_depthmap.html
- Jet Propulsion Laboratory (n.d.): JPL stereo vision. Retrieved September, 2018, from https://www-robotics.jpl.nasa.gov/facilities/facilityImage.cfm?Facility=13&Image=335
Leave a Reply