Toward a uni ﬁ ed theoretical framework for photogrammetry

The objective of photogrammetry is to extract information from imagery. With the increasing interaction of sensing and computing technologies, the fundamentals of photogrammetry have undergone an evolutionary change in the past several decades. Numerous theoretical progresses and practical applications have been reported from traditionally di ﬀ erent but related multiple disciplines, including computer vision, photogrammetry, computer graphics, pattern recognition, remote sensing and machine learning. This has gradually extended the boundary of traditional photogrammetry in both theory and practice. This paper introduces a new, holistic theoretical framework to describe various photogrammetric tasks and solutions. Under this framework, photogrammetry is generally regarded as a reversed imaging process formulated as a uni ﬁ ed optimization problem. Depending on the variables to be determined through optimization, photogrammetric tasks are mostly divided into image space tasks, image-object space tasks and object space tasks, each being a special case of the general formulation. This paper presents representative solution approaches for each task. With this e ﬀ ort, we intend to advocate an imminent and necessary paradigm change in both research and learning of photogrammetry.


Introduction
Photogrammetry is the science and technology of extracting reliable three-dimensional geometric and thematic information, often over time, of objects and scenes from image and range data (Chen et al. 2016). Its objective is to extract various information from imagery. After over 150 years development, photogrammetry has approached to digital photogrammetry through plane-table photogrammetry, analog photogrammetry and analytical photogrammetry (Schenk 2005). Images of an interested scene or object are the primary input data for photogrammetry. With the rapid advancement and interaction of sensing and computing technologies, the spatial resolution of cameras and the computing power of computers have increased tremendously. As such, more subject areas and applications in imagebased science have emerged, such as image segmentation, simultaneous localization and mapping (SLAM), point cloud generation, mesh generation and 3D modeling, where the tasks of photogrammetry are not limited to, and extended beyond, generating Digital Elevation Model (DEM), Digital Orthophoto Map (DOM), Digital Line Graph (DLG) or other traditional surveying products (Li, Zhu, and Li 2000b;Li 2000a).
Theories of traditional photogrammetry focused on topographic applications are facing many intrinsic limitations and challenges. Sensors we are using are extended from metric cameras to uncalibrated, off-the-shelf ones, from a single sensor to sensor systems (Shan 2017). Assumption of vertical photography and the subsequent approximate analytics do not hold any further or need extension for ever popular oblique and multi-view images. Our interest is extending from natural surface, which is usually continuous, to man-made objects, which are not continuous. Such change challenges many of the existing analytics in image matching, surface modeling and object reconstruction. In addition, multi-view imaging or indoor sensing would not allow us to deal with sensed targets as 2.5D any further (Real Ehrlich and Blankenbach 2019; Li and Fu 2018). Instead, mathematics of multi-valued functions become necessary (Boyd and Vandenberghe 2004). Finally, the use of multi-view images also raises the need to consider radiometric variations of the object surfaces (Wu et al. 2011), which has to be based on a framework that combines imaging geometry and imaging physics.
The imagery itself records both geometric and radiometric information of the object through a complicated imaging process. However, many applications in imagebased science are only concerned with either geometric or radiometric information alone. For example, multiview geometry (Brown, Burschka, and Hager 2003;Furukawa and Ponce 2009;Hirschmüller 2008) reconstructs the 3D object relying on the corresponding relations among different images but often neglects the radiometric process of imaging, whereas shape from shading (Horn 1970;Zhang et al. 1999; Kemelmacher-Shlizerman and Basri 2011) recovers the 3D shape of object according to the radiometric process of imaging without using the multi-view geometry. The quality of both tasks might be improved if the two types of information can be utilized in a complementary way. In addition, there is no systematic in-depth understanding about the relations of all the newly or conventional applications. As a result, it is necessary, to build a unified theoretical framework for photogrammetry to make full use of the information recorded in images, to clarify the concepts, and to understand the intrinsic relations of these tasks. To this end, we represent the process of imaging with one mathematical model and formulate all the photogrammetry tasks under this imaging model. According to the domain the tasks occur, the photogrammetric tasks are divided into image space tasks, image-object space tasks and object space tasks.
The paper is organized as follows. Section 2 describes the unified framework of photogrammetry. Sections 3-5 respectively explain the image space tasks, object-image space tasks and object space tasks. Section 6 summarizes the paper.

General expression of the imaging process
As shown in Figure 1, the physics of imaging involves five processes (Bakker et al. 2001;Peng, Zhang, and Shan 2015). The lighting process needs one or more light sources like the sun or skylight. Through the downward transmitting process, the photons from the light source transmit to the ground or object surface. The photons hit the surface and reflect. In the upward transmitting process, photons reflected from the surface transmit to the sensor through the atmosphere or other media. Finally, in the sensor response process, the camera translates the captured photons to image intensity with an analog-to-digital conversion.
Such imaging process can be described by the following mathematic expression (1) where • I x; y ð Þ is the image intensity of the image point x; y ð Þ; • E is the energy of the light source; • ρ is the surface albedo; • X is the coordinates of the object which can represent the surface shape; • A up and A down are the upwelling and downwelling atmosphere; • θ i , φ i , θ r and φ r are the incident zenith, incident azimuth, reflection zenith and reflection azimuth, respectively; • τ up and τ down are, respectively, the upwelling and downwelling transmitting process with the atmosphere A up and A down ; • f is the reflection function which describes the reflection process with the albedo ρ, surface shape (can be represented by surface coordinates X), incident angles and reflection angles in which the albedo depends on the material of the observed object and reflection angles are related to the camera setting. Usually, the reflection process is modeled using the Bidirectional Reflectance Distribution Function (BRDF); • h is the camera response function with the factor parameter γ and camera parameter set c, which includes the interior and exterior orientation parameters.
It is noted we do not explicitly distinguish vectors or scalars as often times it is clear from the context. When dealing with close-range, oblique or aerial images, the influence of atmosphere A up and A down is often omitted and the incident zenith θ i , incident azimuth φ i , reflection zenith θ r and reflection azimuth φ r can be simplified to incident angle and reflection angle. The imaging process can then be simplified to the equation below: Consequently, the simplified imaging process contains the illumination L, albedo of the surface ρ, the coordinates of the object X, incident angle θ i , reflection angle θ r and camera parameter c.

Photogrammetry as reverse imaging
As mentioned above, the imaging process records the geometric and radiometric information of the interested object. The task of traditional photogrammetry is to recover the geometric information from the images, which is actually the reverse of such imaging process. For example, recovering the 3D position of the object is a typical reversed imaging process in which the unknown position of the object is estimated from the observed images with a standard procedure, including tie points matching, bundle adjustment, dense image matching and forward intersection (Wang 1990;Li, Wang, and Zhou 2008). The bundle adjustment helps to estimate the camera parameters including interior orientation parameters (focal length, principle point position and camera distortion) and exterior orientation parameters (camera locations and orientations) (Li and Shan 1989;Beardsley, Tort, and Zisserman 1996;Agarwal et al. 2010;Wu 2016;Shan 2018), whereas the corresponding 3D coordinates of the object are determined through image matching (Gruen and Baltsavias 1987;Gruen 2012;Zhao, Yan, and Zhang 2018) and forward intersection (Wang 1990;Li, Wang, and Zhou 2008). The basic assumption for this to be possible is that the corresponding image points in stereo or multiview images of an object should be uniquely and well defined. However, this may not always be the case since the illumination condition and the reflectance angle (the view angle) are different when considering the imaging process. Generally, considering two or more images, the task would be finding the image points in different images through minimizing the photometric difference. The general task of photogrammetry can be expressed as optimizing the following objective function: whereθ ¼ L; ρ; X; θ i ; θ r ; c f g is the parameter set which consists of the illumination L, albedo of the surface ρ, the coordinates of the object X, incident angle θ i , reflection angle θ r and camera parameter c.
The data term E data in potential energy form records the difference of the corresponding image points of a 3D point in the image matching task or the difference of the observed image intensity and the rendered intensity in the Shape from Shading (SfS) task. E smooth is the smooth term or the penalty term, which represents the smoothness constraints between adjacent image points, such as the disparity smoothness in dense matching and surface albedo smoothness in shape from shading. Usually, E smooth is a combination of multiple constraints rather than one single constraint. Both E data and E smooth are the function of parameterθ. Depending on what subset of parameters inθ we are trying to determine, photogrammetry can be divided into several different tasks.

Photogrammetric tasks and their relations
The tasks in photogrammetry are actually dependent on which variables in Equation (2) are set as known to a level of certain confidence while the others are to be determined. For tasks in image space, the image I often is the only observation. When an image I is to be divided into several parts, the task is image segmentation. When the correspondences (described by disparity map M) of two or more images are to be determined, the task is image co-registration or image matching. For the object-image space tasks, the imagery is still the observation, whereas there can be known sparse or dense point coordinates of the object, such as control points or DEM. When the 3D structure X of the object and camera parameters c are to be determined, the task is Structure from Motion or aerial triangulation. When the map of the environment and the location of the camera are to be determined, the task is SLAM. When the 3D coordinates of the interest object X are to be updated, the task is surface refinement based on shape from shading. For tasks in object space, the point cloud is often the observation which can be obtained through image matching or laser scanning. Such tasks include surface triangulation, DEM interpolation, point cloud segmentation and 3D reconstruction.
In summary, each task in photogrammetry can be considered as a reversed imaging process while different variables to be determined lead to different tasks.

Image segmentation
Image segmentation aims to partition a given image into a set of meaningful regions or objects (Bar et al. 2015;Zhang, Fritts, and Goldman 2008;Mumford and Shah 1989). It is an important and fundamental problem in photogrammetry and computer vision with many real-world applications, such as autonomous driving, virtual reality and 3D reconstruction.
The camera records the reflection intensities of the objects as gray values of the image through a series of physical imaging process (Equation 2). Because of the difference of albedos, different objects often show different gray values on the image, while pixels belonging to the same object always show consistent gray values (Klinker, Shafer, and Kanade 1988;Mumford and Shah 1989). Thus, we can segment the image into meaningful regions or objects according to the difference of their gray values.
Image segmentation can be divided into unsupervised method and supervised method. For unsupervised method, it aims to partition the image I into a set of regions Ω i f g N i¼1 such that I is homogeneous within each region and varies rapidly across different regions (Mumford and Shah 1989). Let κ : I p 2 I ! r p be a mapping function which maps the image pixel to a region index. The energy function for image segmentation can be formulated as where N is the neighborhood set and δ Á ð Þ represents the indicator function (Salah, Mitiche, and Ayed 2010;Vu and Manjunath 2008). D s κ I p À Á ; I p À Á is the data term measuring the inconsistency between the assigned region indices and the image pixels. R s I p ; I q À Á represents the regular term punishing the difference between neighboring pixels of different regions. Mapping function κ minimizing the energy function of Equation (4) is the segmentation results of image I, which assigns each pixel a unique region index.
Recently, with the increase of public datasets, supervised segmentation method (e.g. machine/deep learning) has achieved remarkable results (Shelhamer, Long, and Darrell 2017;Badrinarayanan, Kendall, and Cipolla 2017;Chen et al. 2018;Yang et al. 2018). Compared with unsupervised method, supervised segmentation method aims to construct a prediction model to assign each pixel a class label instead of the regions index. Denote L as the set of ground truth labels, and υ : I p 2 I ! l p 2 L is the prediction model to assign each pixel a class label. The energy function for supervised image segmentation method can be similarly modified as is to measure the inconsistency between the predicted pixel label υ I p À Á and the corresponding ground truth label l t ð Þ p . Energy function Equation (5) is usually optimized with stochastic gradient descent (SGD) (Kiefer and Wolfowitz 1952), Adagrad (Duchi, Hazan, and Singer 2011), Adam (Kingma and Ba 2015) and other methods.
In Figure 2 we visually provide some segmentation results on the ISPRS airborne image datasets of Vaihingen (Gerke 2014). The unsupervised segmentation results are computed using Potts model (Storath et al. 2015), while the supervised segmentation results are predicted using PSPNet (Zhao et al. 2017). The unsupervised method tends to partition the image into piecewise segments with consistent gray values, while the supervised method aims to segment the image into semantically meaningful objects or classes with the help of the prior label information.

Image matching
Image matching aims to find corresponding points between two or more images. It is the prerequisite for automated bundle adjustment and 3D reconstruction from multi-view images. For decades, image matching has attracted considerable attentions in both photogrammetry community (Gruen 2012) and computer vision community (Scharstein and Szeliski 2002).
Image matching can be divided into two categories: sparse matching and dense matching. Sparse matching aims to match corresponding points for bundle adjustment, image registration or applications that need well distributed corresponding points. As such, sparse matching algorithms only need to match key points with obvious appearance between images. Generally, image feature points are first extracted by using interest operators, such as Harris (Harris and Stephens 1988), Förstner (Förstner and Gülch 1987), DoG, SIFT (Lowe 2004), Daisy (Tola, Lepetit, and Fua 2010), or their variations. After that correspondences are established based on some matching algorithms that minimize certain similarity costs.
Different from sparse matching, dense matching is used for 3D reconstruction, which recovers the fine 3D structures of objects from multi-view images. In order to completely reconstruct the 3D shapes and textures of objects, matching is carried out as dense as possible.
With the camera parameter c being known, for two images I 1 x; y ð Þ and I 2 x; y ð Þ, their epipolar images I e1 x; y ð Þ and I e2 x; y ð Þ are generated. The general form of the dense matching problem is expressed as minE M; I 1 x; y ð Þ; I 2 x; y ð Þ; c 1 ; c 2 ð Þ Currently, pixel-wise matchers are the mainstream dense matching methods. The problem of finding corresponding points between images is transformed to compute the pixel-wise disparity map between the left and right epipolar images. Specifically, the objective function of pixel-wise matching can be expressed as where p is a base image pixel in the left epipolar image I e1 x; y ð Þ, q is the pixel in the neighborhood N p of p, M p and M q are disparity values of pixel p and q, respectively. For a pixel p ¼ x; y ½ T with disparity of M p , its corresponding pixel in the right epipolar image I e2 x; y ð Þ, is p 0 ¼ x À M p ; y Â Ã T . The first term of Equation (7) is the sum of matching costs, while the second is the smoothness term used to constrain the continuity of the disparities of adjacent pixels, D m p; M p À Á and R m M p ; M q À Á are matching cost function and penalty function.
There are many kinds of matching cost functions. One of the simplest ones is the Sum of the Absolute Differences (SAD).
where I 1p is the pixel value of the pixel p in the left image and I 2p 0 is the pixel value of its corresponding pixel in the right image. SAD assumes the lighting conditions of different images are constant, and the reflections with different zeniths and azimuths for the same location of the object are the same, which means that the reflectance function f in Equations (1) and (2) is independent of the variables θ i , φ i , θ r and φ r . This, in turn, means the corresponding points in different images have approximately equal pixel gray values. However, the actual situation is usually not ideal. The lighting conditions vary with time, and the reflections vary with zeniths and azimuths, reducing the effectiveness and robustness of SAD. More robust matching cost functions, such as Normalized Cross-Correlation (NCC) (Zhang and Gruen 2006), mutual information (MI) (Hirschmüller 2008), census (Zabih and Woodfill 1994), etc. are actually used. The typical algorithms used to solve Equation (7) include Winner-Take-All (WTA), dynamic programming, scanline optimization, simulated annealing and graph cut (Scharstein and Szeliski 2002). Among them, WTA is a local solution that ignores the smoothness term, while the others are global solutions that consider the mutual constraints between matching results of adjacent pixels. Figure 3 shows an example of dense matching of two aerial images of Wuhan University campus. Figure 3(a,c) is two input images I 1 x; y ð Þ and I 2 x; y ð Þ, whose epipolar images I e1 x; y ð Þand I e2 x; y ð Þare Figure  3(b,d), respectively. Dense matching is performed using I e1 x; y ð Þ and I e2 x; y ð Þ as input, and the matching result is the pixel-wise disparity map of I e1 x; y ð Þ, as shown in Figure 3(e), in which the disparity values are rendered by different colors.

Bundle adjustment
Bundle adjustment is an important step of Structure from Motion (SfM) (Zhang et al. 1999;Triggs et al. 1999). The classical SfM algorithm contains feature matching and bundle adjustment. Feature matching provides tie points between images for bundle adjustment, which simultaneously estimates the camera parameters and the corresponding object coordinates of the tie points through optimization.
Bundle adjustment is the procedure of refining a visual reconstruction to produce jointly optimal 3D structure and camera parameters (Triggs et al. 1999). It is based on the collinearity equations, i.e. the image point, object point and projection center all lying along a straight line which is the geometric process of the imaging. Theoretically, the image point of the same object point in different images should be the same. However, due to the noise in the image, image distortion and feature matching error, there is a difference between the measured image point and the projected image point from the estimated object point, i.e. the re-projection error. The re-projection equation is shown below: where x ¼ x 0 ; y 0 ð Þ is the back-projected image point coordinates, c is the camera parameters and P is the projection function.
Considering a dataset with m images and n tie points generated by image matching, through minimizing the re-projection error, the 3D structure and camera parameters can be estimated. min E c; X; x; I ð Þ¼min c j ;X i X n i¼0 X m j¼0 jj:::jj x i;j À P c j ; X i À Á 2 (10) where X i is the object coordinates of the i-th tie point, c j is the camera parameters of the j-th image, and x i;j is the image coordinates of X i in the j-th image.
Since the re-projection function is not linear, the bundle adjustment is a non-linear optimization process. The problem is complex because the number of parameters can be huge. Optimizing a large-scale nonlinear function is not easy. Gradient Descent, Gauss-Newton and Levenberg-Marquardt or their variations are often used to solve the problem (Agarwal et al. 2010). These methods can achieve correct results when the initial guess of the camera parameters and 3D structure is good. However, a local optimal solution may be achieved under bad initial guess.
There are several products concerning bundle adjustment and SfM, such as VisualSFM (Wu 2016), COLMAP (Schonberger and Frahm 2016), Agisoft MetaShape (https://www.agisoft.com), etc. In this paper, a group of aerial images of Wuhan University campus are used as an example of bundle adjustment. As shown in Figure 4, MetaShape recovers the position and pointing of the camera and the 3D structure of the building successfully.

SLAM
Simultaneous localization and mapping (SLAM) aims to build a map of the environment while simultaneously determining the location of a robot or a moving device. According to the sensors being used, SLAM can be categorized into three main types: laser-based, sonar-based and vision-based (i.e. image-based) (Aulinas et al. 2008). Considering that the vision-based SLAM (vSLAM) (Fuentes-Pacheco, Ruiz-Ascencio, and Rendón-Mancha 2015) is more popular, less expensive and more related to the imaging process, this paper is limited to this subject only. Essentially, the problems discussed by vSLAM and SfM are the same, i.e. estimating the camera parameters (called localization in vSLAM) and 3D structure (called mapping in vSLAM) from a sequence of images. Classical vSLAM system normally includes visual odometry, optimization, loop closing and mapping. The ultimate goal of vSLAM is real-time mapping and navigation. As such, all calculations, including route prediction, need to be done on-thefly. In addition, as the images in vSLAM are collected incrementally, it is necessary to avoid or reduce the impact of drift, by Kalman filtering or optimization (Bresson et al. 2017). The mathematic details for vSLAM are similar to the ones of bundle adjustment as described in Section 4.1.
An example of vSLAM is shown in Figure 5 (Ji et al. 2020). The images were collected from a wide-baseline vehicle-mounted Ladybug camera which contains six fisheye lenses, five of which evenly cover a section of 360°horizontal scenes, and one fisheye lens points upward. Figure 5(a) shows the examples of the input fisheye images. The whole panoramic SLAM system is based on the fisheye calibration model, the panoramic model and the bundle adjustment (Ji et al. 2020), and  several typical vSLAM steps, including initialization, tracking, key frame selection, local mapping, loop closure detection and global optimization (Mur-Artal and Tardos 2017). The recovered trajectory is shown in Figure 5(b) in comparison to the ground truth recorded with a highly accurate Applanix POS/AV navigation system (about 0.1 m localization accuracy).

Shape from shading
Shape from shading aims to recover the shape from a single image or multiple images (Horn and Brooks 1986). From the imaging process, it is known that the image intensity is the interaction of illumination, surface albedo and surface normal. Since surface normal is the derivative of the surface shape, the shape can be recovered through solving the surface normal from the imaging model. Theoretically, it is a process of minimizing the difference between the image intensity and the inverse rendering. However, besides the surface normal, the illumination and surface albedo are usually unknown which makes the recovering of the surface normal from a single image an ill-posed problem. To make the problem solvable, the illumination and surface normal are often assumed to be known or constant (Zhang et al. 1999). Another kind of methods that relax the conditions in shape from shading is Photometric Stereo (PS) (Woodham 1980). It is demonstrated (Woodham 1980) that the surface normal can be recovered from multiple images captured from a fixed viewpoint under different illuminations. The problem is often solved with the variational method (Horn and Brooks 1986).
In recent years, shape from shading is also used to refine a surface known as a prior to a certain extent, especially combined with multi-view stereo (MVS) (Kim, Torii, and Okutomi 2016;Wu et al. 2011). MVS can be used to generate a coarse surface, which is then to be refined by shape from shading. For texture-less or texture-weak areas that are hard to handle by MVS, shape from shading can recover the fine shape. When refining using shape from shading, the coarse surface can provide an initial surface normal while the illumination and surface albedo can be estimated at first. The surface normal can then be further refined with proper constraint. The iteration of the two steps above makes the final surface shape well determined. The problem is often modeled as an energy function to be minimized through gradient descent or non-linear least squares method. min E ÑX; ρ; L; c; I; X ð Þ¼min where ÑX is the vertex displacement along its normal, ρ is the albedo of a vertex, L is the illumination, α and β are used to balance the data term E data , geometry smoothness term E s1 and reflectance smoothness term E s2 . As shown above, shape from shading is also a reverse problem of the imaging process. When refining coarse surface, even the camera parameter c is known and the illumination L, albedo ρ in Equation (2) can be estimated with the coarse surface, there are still three unknown variables: the fine surface X, incident angle θ i , reflection angle θ r . As such, the Lambertian reflectance model which reflects equally in all directions is often assumed to eliminate the reflection angle θ r . As for the incident angle θ i , it can be estimated with the known imaging time and geographic coordinates for natural terrain. When dealing with other objects such as buildings, spherical harmonics (Basri and Jacobs 2001) are introduced, which use several variables and the surface normal to represent the imaging process for the scene. Based on the method of (Peng, Zhang, and Shan 2015) in which the 90 m SRTM is refined to 30 m with Landsat 8 imagery, the GaoFen-1 image is used to refine the SRTM from a resolution of 90 m to 15 m. The original resolution of GaoFen-1 image is 16 m and the resolution of SRTM is 90 m. At first, both the image and DEM are resampled to 15 m. Then, the shadingbased DEM refinement algorithm is applied. At last, the DEM is refined to 15 m. As shown in Figure 6, the refined DEM is with higher resolution and with more geospatial details comparing to the original DEM.

Object space tasks
Tasks of or related to photogrammetry may also occur in object space, including point cloud registration (Besl and McKay 1992;Serafin and Grisetti 2015), point cloud segmentation (Golovinskiy and Funkhouser 2009;Papon et al. 2013), mesh generation (Shewchuk 2002) and object reconstruction (Yan, Shan, and Jiang 2014;Xiong et al. 2015), etc. The information such as point cloud in the tasks can be generated through Light Detection And Ranging (LiDAR) or stereo matching. Compared to LiDAR, the object information generated from stereo matching contains not only the geometric information but also the radiometric information in the images. Therefore, the information and method in the image space can also be considered for the object space tasks, e.g. considering the color information in point cloud segmentation and using shape from shading to refine the surface.
Among the tasks occurred in object space, a segmentation example of a point cloud derived from image matching is shown in Figure 7. It is worthwhile to note that point cloud segmentation is still an optimization problem and can be formulated with the optimization framework as following: where P represents a set of data points whose coordinates are represented by X in Equation (2), R is the label of the points, N p is an assumed neighborhood for the data point p, L is a given set of labels (e.g. planes), L 0 is the set of labels appearing in R, D p : ð Þ is a function measuring the difference between data points and labels, w pq is the weight of the smooth term, h r is the weight of label and δ : ð Þ is an indicator function. The first term is the data term which measures the discrepancy between data points and labels. The second term is the smooth term which measures the label inconsistency between neighboring points. The third term is the label term which measures the number of labels appearing in R.

Concluding remarks
This paper expresses the imaging process under one physical and mathematical combined model. Photogrammetry is regarded as a reverse problem of the imaging process and formulated with one general optimization framework. Based on this unified framework, different tasks in photogrammetry are deducted to a special case of this framework. For each such task, Figure 6. An example of shape from shading for surface refinement. With (a) the GaoFen-1 image, the original SRTM is refined from a resolution of (b) 90 m to (c) 15 m using shape from shading. we provide one representative solution and example. In summary, all the photogrammetry tasks are essentially the reversed imaging process and it can be formulated under one framework where different variables to be determined lead to different tasks.
It should be noted the division of tasks into image space, image-object space and object space reflects major domain that the problem is being solved. However, it does not mean no other space or information shall be involved at all. In fact, often times photogrammetric tasks are resolved in more than one domain if relevant and reliable information is available. It is not uncommon to see many solutions that work across multiple spaces but often with one being the major one.
Finally, we would like to restate the motivation of the work. The traditional theoretical framework for photogrammetry that has been used for decades could not fully reflect the current practices and meet the needs of a much broader community that are using photogrammetry techniques. This discipline certainly needs an imminent and necessary paradigm reshaping with rigorous and solid fundamentals. To this end, the paper presented a systematic and comprehensive structure with demonstrated examples for both research and learning.

Notes on contributors
Jie Shan is a Professor with the Lyles School of Civil Engineering, Purdue University, USA. His research interests include object extraction and reconstruction from images and point clouds, urban remote sensing, and data mining of spatial, temporal, and semantic data.
Zhihua Hu is working toward PhD degree in photogrammetry and remote sensing. His current research interests include mesh refinement, and multi-view images 3D reconstruction.
Pengjie Tao is currently an associate research fellow. His research interests include photogrammetry, registration of optical images and LiDAR points, and multi-view images 3D reconstruction.
Lei Wang is working toward PhD degree in photogrammetry and remote sensing. His research interests include 3D computer vision, machine/deep learning, and 3D understanding.
Shenman Zhang is pursuing toward PhD degree in photogrammetry and remote sensing. His current research interests include point clouds registration, and building reconstruction.
Shunping Ji is a Professor with the School of Remote Sensing and Information Engineering, Wuhan University. His research interests include photogrammetry, remote sensing image processing, mobile mapping system, and machine learning.