Accuracy and utility of the Structure Sensor for collecting 3D indoor information

Abstract This paper presents the result of an investigation into the utility of the Structure Sensor developed by Occipital Inc. and accuracy of its output for 3D surveying of interiors of buildings in relation to Surveying (Cadastral Survey) Regulation 2005 in Victoria, Australia. The paper investigates data acquisition issues, defines guidelines to obtain the best reconstruction result, and evaluates the result against the requirements set by the Regulation. The findings suggest a mixed result. The sensor delivers more accurate outputs for the smaller room sizes. Also, the accuracy does not meet the requirements, but it was found to be close to what is expected in the Regulation. Finally, the paper argues that the device is user-friendly enough to be used by non-experts for crowdsourcing indoor information and, the accuracy of its output can meet the needs of other domains such as indoor navigation and public safety.


Introduction
Human scale three-dimensional (3D) understanding of indoor environments is increasingly becoming important in many discipline areas. Using 3D models of indoors, mechanical, electrical, and plumbing facilities can be better positioned in relation to the building. Law enforcement, public safety, and first responders can benefit from 3D models by acquiring a total situational awareness of an indoor environment. Finally, as-built 3D models of buildings could assist owners and managers in the maintenance of real estate properties.
In recent years, there has been an increased interest in using point cloud information in land development to visualize the status and control the quality of the project at different stages. As Bosche, Haas, and Akinci (2009) observed, project performance control tasks often require 3D as-designed and as-built information at object level to map the differences that are generated during the construction process. Past research projects have already proven the value of 3D modeling for applications like progress control and measurements quality assessment of structural works (Kim, Son, and Kim 2013;Tang et al. 2011;Turkan et al. 2012).
In this context, Rosser, Morley, and Jackson (2012) argues that the capture and management of building interior data are different from outdoor environments, with a specific set of technical requirements, and social and legal challenges. From a technical perspective, heavy-duty laser scanners are usually used for 3D indoor reconstruction. The high hardware and software cost of laser scanning makes it infeasible for small building infrastructure projects (Zhu and Brilakis 2009), where project management is focused on cost cutting.
Many affordable sensing-based technologies are now available to support building modeling and indoor surveying that are used by professionals. Besides, the ease of use in these consumer-grade technologies, also, make it possible for non-experts to map and model indoors. These devices bring opportunity to fill the indoor information gap that we face worldwide. Similar to smartphones that brought about the volunteered geographic information phenomena, these widespread technologies facilitate the realization of the volunteered 3D indoor information phenomena.
The most recent development in these devices is the mobile Structure Sensor (Occipital 2014), developed by Occipital Inc. in collaboration with Prime Sense in 2013. This small, lightweight, and wireless depth sensor collect and instantly register point cloud data, making the 3D reconstruction of indoor spaces more affordable. For this sensor to be useful in modeling indoors by volunteers, it is required to be user-friendly and for its output to meet or somewhat close to the standards set by industry.
Therefore, this paper investigates the utility of the Structure Sensor regarding ease of use, and the quality of 3D models reconstructed using the sensors in relation to a surveying standard Surveying Regulations 2005 in Victoria, Australia.
The rest of the paper is organized as follows: Section 2 introduces the sensor, its engineering and 3D computation principle. Section 3 discusses the research methodology and expands on the data collection and processing steps. Section 4 presents the result of the research and Section 5 complements the results with a discussion. The paper concludes in Section 6.

The Structure Sensor
The Structure Sensor developed by Occipital (2014) is an open source platform that performs as a mobile Structure Light System (SLS) when connected to a tablet, mobile phone, or a computer. This SLS consists of a laser-emitting diode, infrared radiation range projector, and an infrared sensor and the iPad's RGB sensor that send data to a system on a chip (SOC) for processing (Figure 1).
At the time of investigation, a detailed description of the technical specifications was not released to public and the sensor configuration had been investigated only through experiments by application developers. Therefore, according to (http://Reconstructme.Net/2014/07/29/ Structure-Io-Sensor-In-Reconstructme/, accessed 29 September 2014), the Structure Sensor is based on the Prime Sense infrared depth-sensing technology (Table  1). There are certain innovations that this device brings. This includes communication with mobile devices, integration with mobile applications, battery and low-power operation, and improvement in its mobility and versatility aspects. Also, the sensor is added to any platform through the customized USB2.0 cable available with the Structure Software Development Kit.
The output stream from the Structure Sensor alone consists of a point data-set, with a resolution of 640 × 480 pixels, where every pixel records the distance from sensor to the target. The infrared sensor records the reflectance intensity of the infrared (IR) light pattern projected by the IR projector onto the target while its SOC triangulates the 3D scene.
According to its inventors (Freedman 2012), the micro-lenses in the projector have different focal lengths that focus the infrared radiation to form a non-uniform pattern of dots that varies with distance from the sensor. The reference pattern is stored in the memory of the sensor to help establish correspondences between pattern speckles and calculate range data as a disparity to the current image. Through image processing and triangulation algorithms, the distortion of the projected pattern is converted into 3D information. Khoshelham and Elberink (2012) discuss in detail the correlation between disparity and depth, finding that disparity values are inverse proportional to the actual depth of a point Z. Figure 2 shows the geometry of converting disparity to depth when the target point (black dot in the figure) is projected at depth Z on a plane farther from the IR speckle pattern reference plane (green dot). The blue circle shows the location of the IR camera at a distance b = 65 mm from the IR projector, the red circle. The purple circle represents the iPad's RGB camera, at a distance c = 6.5 mm from the IR camera (Occipital 2014), as allowed by the precision bracket accessory mounted on the iPad.
Depth images are formed on the imaging plane through the perspective IR camera. X ref is the horizontal distance of the reference point to the IR projector and X is the horizontal distance of the actual point being measured from the IR projector. Z ref is the depth of the point on the reference pattern, and Z is the actual depth to the target point. The disparity d on the imaging plane is the displacement in pixels between the two patterns, where This model assumes the collinearity condition is met, where the imaging ray is a straight line, and the target point, image point, and perspective center are positioned on the same imaging ray.
Image formation process involves the projection of a set of points with coordinates (x, y, z) in the 3D space   (P 3 ) to a 2D imaging plane (P 2 ), points with coordinates (x, y). Therefore, the projection is defined by the relation: with M, the perspective projection matrix defined by the equation: where C is the coordinate matrix of the camera origin in the world coordinate frame, K is the matrix of intrinsic camera parameters, R is the matrix of extrinsic camera parameters (rotation/translation) and where c x and c y are the principal point coordinates, usually positioned at the image centre, f x and f y are the focal lengths expressed in pixel units, p x and p y are principal point coordinates usually positioned at the image centre.

Methodology
The experiment includes the following steps: acquiring the point cloud of three indoor scenes of different sizes, data processing, and quality assessment of measurements in post processing against the existing models as set by Surveying Regulation 2005 in Victoria, Australia. Regulation 7(1)(c) of the Surveying (Cadastral Surveys) Regulations 2005 states that licensed surveyors must ensure all lengths are measured or determined to an accuracy of 10 millimetres + 60 ppm (PPM). Operational issues when mapping indoor using the Structure Sensor were also documented. The indoor scenes studied during this experiment are described in Table 2. When it is attached to a mobile device, the sensor allows scanning of a whole room by a single user, with the SLS and the computer base connected to a personal Wi-fi hotspot. The applications used during data acquisition phase were Occipital's official applications for iPad, which allowed tracking the progress of the scan in real-time and giving an immediate feedback of the scanning process.
Structure calibrator application for iPad calculates the precise alignment between the coordinate systems of both the sensor and the iOS device camera to improve distance calculations. This initial calibration was performed on the sensor, for a better accuracy of the reconstructed model.
To understand the specifics of Structure Sensor raw point cloud data, the RGB data from the iPad camera was not collected for the experiment. Structure application for iPad is a 3D reconstruction system that progressively creates a 3D polygon mesh model of a scene in real time, at 7-15 frame rate. The mesh was created through triangulation (Bern and Eppstein 1992), storing information about the vertices (v x , v y , v z ) and normals to faces (n x , n y , n z ).
In terms of the data collection with the Structure sensor, the paper adopted the following scanning process as the best practice guidelines for performing indoor scans to save time and obtain a better scanning result. The user should be positioned in the middle of the room, for a minimum circular movement, maintaining the best scanning range for the sensor (under 2 m). The scanning cube dimension in the settings should allow a maximum coverage of the wall and a minimum tilt of the sensor upward and downward to collect ceiling and floor information. The movement should follow a floor < tilt > wall < tilt > ceiling [left/right turn] ceiling < tilt > wall < tilt > floor motion pattern for better results. Scanning is performed under desks as well, for a better coverage of the wall. The starting point should be facing the wall with less texture and a full rotation to return to the initial pose. This strategy ensures minimum loss of texture and important features from the target scene. For bare walls, to avoid scanning issues caused by the lack of texture, a geometrically complex object could be placed in front of the scanner to improve the alignment of successive frames performed by the sensor (Dutta 2012). Scanning a wall from distances larger than 2 m results in coarse and less accurate representations of the target, with the danger of losing track of the recorded scene. Therefore, during scanning, keeping a constant distance less than 2 m from each wall is recommended, ensuring a higher quality scan.
Different scanning approaches were tested for the 3D reconstruction of the studied rooms. The approaches include scanning the whole room in one run using a circular trajectory with a return to the initial position; scanning half of the room at a time; scanning one wall at a time; and the 3×3 photogrammetry approach (Waldhäusl and Ogleby 1994).
The processing phase involves removing outliers and clearing the data of unnecessary furniture objects present in the data-set. Finally, measurements were extracted from the 3D models for the accuracy assessment against the Surveying Regulation 2005 in Victoria, Australia.

Results
The matrix of intrinsic parameters for depth stream was obtained from the Calibrator application within the file calibration.yml saved with the 3D model for offline reconstruction on the computer drive. For the Structure Sensor depth stream, the matrix on intrinsic parameters was computed to the matrix provided in Appendix 1.
As there was no information provided about the RGB, depth or infrared camera distortion parameters, it was relative errors of the measurements were calculated to assess sensor's reliability: The distribution of absolute errors throughout the measurements for each room ( Figure 5) confirms the sensor's tendency to underestimate measurements, with positive deviations from existing models' dimensions.
In Figure 5, each horizontal axis holds the case study, while the vertical axis shows their absolute error values for measurements.
n assumed they were not used in the depth image formation algorithm. The average scanning duration of a room was 6 min. The data collected were 3D models of the scenes, which were exported in PLY file format. From the data acquisition workflow perspective, the closest result to the actual 3D geometry of the scanned scene was given by the model acquired with a circular trajectory, when scanning in one run, as seen in Figure 3(c). For the remaining data acquisition workflows, the alignment of raw, unprocessed frames was affected by the accumulation of systematic distortions within each of the captured depth images. The 3D model resulted from the circular scanning trajectory was selected for the point cloud processing phase, which simplifies the process by eliminating the pre-processing alignment step of separately acquired scenes.
The accuracy of the measurements extracted from the scan of each room presented in Figure 4 was assessed against measurements extracted from their existing models to see if they fit the precision required for 3D reconstruction for building surveys.
As shown in Appendix 2, a total of 20 measurements were divided into two categories: measurements acquired from the resulted 3D models and from existing models. For each reconstructed room, absolute and measurements. In Figure 6, the R-squared value shows a good fit of the data to observed measurements, with good correlation between estimated measurements and existing data. Still, root-mean-square error (RMSE), a measure of the absolute fit of the data, shows an average distance of 0.742 m from the data point to the fitted line.

Data collection
The accuracy of a 3D reconstructed model of indoor environments using depth information is limited by the detail of the input data, usually imposed by the user. For example, occlusions in the point clouds were mostly The 3D model for the smallest area room, Room 3, has more accurate room dimension estimations, with a minimum absolute error of 0.021 m and a maximum of 0.59 m, while Room 1 model's absolute errors spread from 0.157 to 2.012 m. The area size of the interior subject to reconstruction is likely to impact on the accuracy of the results, with lower deviations from true values for smaller rooms.
The relative errors of the dimensions extracted from 3D scan model have a standard deviation of 0.091 m. This high value indicated that the relative errors tend to vary over a large range of values, with half of the relative errors under the median value of 0.093. The mean relative error percentage (MRE) was computed to 12.217% relative size of the errors within the actual Also, when scanning at distances larger than 2 meters, regularly distanced vertical bands were artificially constructed by the sensor (Figure 7(b)), especially affecting the scan of less-structured surfaces. This artifact was observed with other close-range sensors like Microsft's Kinect (Escardo-Raffo 2011). Similar to the previous issues, this problem is related to the sensor's technology and presents itself as a barrier for volunteers.
A similar issue arises for windows, where the IR pattern penetrated the glass surfaces, resulting in outlier points, beyond the bounding cuboid of the room (Figure 7(b)). The ray casts depth shadows by occlusions in the scene (Figure 7(a)), which represent large areas in the scene, depending on the distance and viewpoint angle with the target. This is an issue that is common in laser-ranging devices and covering the windows, and minimizing solar light by volunteers improves the scanning result in the areas next to windows, reducing the loss of data.

Quality of outputs
Based on the analysis of the error metrics in the previous section, Structure Sensor delivers better 3D reconstruction models for smaller rooms, with smaller deviations from true dimensions. In our case study, we had two rooms, a residential building and one university computer lab. One argues the device is fine for modeling rooms of apartments by either experts or non-experts volunteers. generated by the position of objects (furniture) in front of the walls or surface materials. Therefore, a scene preparation before data acquisition is necessary (Dutta 2012), to avoid the loss of data. A scene preparation is part of the normal routine in indoor mapping by experts. For non-expert volunteers, in case the objects causing occlusion are removed, this should not be a problem. But if the objects are fixed, then the gap in the model is inevitable which is a barrier for non-experts to generate complete models During the data collection phase, the central issue was the loss of sensor position. The loss was caused by the fast motion of the handheld scanner or by the target's lack of texture to support the automatic alignment of consecutive frames. Therefore, when the camera tracking was lost, the scanning process needed to restart. This issue was resolved after a few trials by which we established the optimum scanning speed. Both experts and non-expert volunteers find out the best movement speed after a few try.
The incremental registration algorithm of successive frames allows a progressive accumulation of distortions along the scan path and results in incorrectly estimating the actual location of the sensor and an inaccurate estimation of angle movements. As a result, the scan skipped a portion of the wall and closed the loop, causing loss of data because of incorrect sensor position estimation (see Figure 4(a)). This issue is attributed to the sensor and its registration model and potentially limits the ability of the non-expert. Alternative registration methods are used. However, this challenge is beyond the capacity of non-experts to overcome.   clouds in the context of volunteered 3D indoor information. It concluded that the Structure Sensor achieved a higher accuracy for smaller rooms, with an approx. 3 m × 3 m area, at a scanning range of 2 m. The results meet the requirements of application domains such as indoor navigation and indoor positioning, but not the requirements for cadastral surveys in Victoria (Surveying (Cadastral Surveys) Regulations 2005). The investigation has determined that further analysis is needed to assess the distortion errors of the depth camera through calibration and test if the tendency to underestimate distances persists after calibration operation. This tendency needs to be quantified and applied as a correction for the 3D reconstruction process to improve the scanning results. At the same time, better scanning algorithms are necessary for a more accurate reconstruction of the geometry of large indoor scenes. The project could be extended to obtain RGB data in addition to the depth data to advance the current investigation.
According to Regulation 7(1) of the Surveying (Cadastral Survey) Regulations in Victoria (Surveying (Cadastral Surveys) Regulations 2005), licensed surveyors must ensure all lengths are measured or determined to an accuracy of 10 mm + 60 ppm (PPM). This is the level of accuracy that is required for ensuring quality data for managing and security property ownership rights. Other application domains such as indoor navigation and public safety do not necessarily require such level of accuracy.
For the Structure Sensor, the results do not quite meet, but are close to the requirements for cadastral surveys in Victoria (Surveying (Cadastral Surveys) Regulations 2005), with deviations higher than the maximum admitted of 0.01 m. Although the outputs do not meet the surveying standards, such a level of accuracy is enough for aforementioned application domains. This means that even the volunteers and non-experts produce indoor models that satisfy the needs of a wide range of applications.
Also, the inaccurate reconstruction of glued scenes from Figure 3((a), (b), and (d)) could be a result of the discretization error as described by (Basu and Sahabi 2002) for stereo vision systems, and the limitations caused by sensor's narrow field of view. However, all the scanned models of Room 1 displayed in Figure 3 have the tendency to close toward the center of the room, a reason to believe that the depth sensor might report underestimated distances. This issue needs to be further investigated.
Random errors could have been introduced by the user through deviations from the circular trajectory to avoid obstacles in the room. Other factors like the time allowed acquiring the data-set and user's attention to detail also impact the reconstruction result.
The errors in the 3D reconstruction using Structure Sensor could be a consequence of not calibrating the IR sensor previously to the experiment, relying only on the calibration performed by the iOS application, as mentioned in Section 3. Other systematic errors were probably introduced by Structure iOS application algorithm during the registration of the frames in real-time without RGB information, as well as the algorithm used during data acquisition.
The limitations of the sensor, when used for indoor surveying, could be excused by the initial purpose; the sensor was designed for 3D scanning of objects. However, the quality of the outputs was still used in other application domains. Both experts and nonexperts by implementing a different set of actions, a different scanning trajectory and choosing smaller range from scanned objects create 3D indoor models.

Conclusions and further work
This paper investigated the use of the recently released mobile depth sensor, Structure, and the accuracy of measurements from the sensor data, especially point