Challenges in surgical video annotation

Abstract Annotation of surgical video is important for establishing ground truth in surgical data science endeavors that involve computer vision. With the growth of the field over the last decade, several challenges have been identified in annotating spatial, temporal, and clinical elements of surgical video as well as challenges in selecting annotators. In reviewing current challenges, we provide suggestions on opportunities for improvement and possible next steps to enable translation of surgical data science efforts in surgical video analysis to clinical research and practice.


Introduction
Annotation of visual data is important as it provides ground truth labels of real-world objects, scenes, and events that can then be utilized to train computer vision algorithms. While everyday tasks such as face recognition and some types of object recognition in classification can be performed with a high degree of accuracy, there remains quite a gap when comparing state-of-the-art computer vision performance to that of humans. This gap is particularly pronounced in surgery.
Indeed, several elements about surgery present challenges when considering methods for annotation of surgical data, particularly surgical video. Understand that operative events can be defined temporally, over the amount of time in which they occur; visually, as discrete spatial elements; or a combination of both. Algorithms should be designed to impart clinically significant outputs, which, therefore, requires clinically significant temporal and spatial annotations [1]. However, what constitutes a clinically significant event that requires annotation? In the case of bleeding, is bleeding an event that happens in discrete time periods that can be easily annotated? Which episodes of bleeding are worth annotation as clinically notable, and which can be considered 'normal' surgical oozing that does not require annotation? Such questions highlight the importance of considering and addressing challenging annotation questions upfront in a given project.
This paper reviews some of the challenges that are currently being tackled in annotation of surgical video and offers a clinical perspective on considerations to be made by researchers when defining annotation their schemas. We review challenges that arise in the selection of annotators, in spatial annotation, phase annotation (annotation of operative steps), annotation of clinically meaningful events, and annotations of surgical performance (Table 1).

Challenges with annotators
In considering the domain experience of annotators, there is a consideration of not just an annotator's experience in annotating video but also their experience in surgery. When selecting annotators, balancing expertise in these two domains is a challenge that will require further research to determine which attributes of annotators result in the generation of clinically relevant and consistent annotations. Clinical experience can be an initial discriminator when classifying different annotators resulting in 'Clinical Expert', 'Clinical Trainee', 'Layperson', or 'Crowd' ( Table 2). While only a handful of papers in computer vision have been published to date regarding the differences in annotations between clinical experts (or clinical trainees) and crowd annotators, studies in the field of surgical education have investigated differences between such annotators in identifying surgical anatomy or rating surgical performance [2]. Prior work has demonstrated the feasibility of utilizing crowd annotators to annotate elements of video ranging from anatomic structures to performance. For simple tasks with well-defined criteria, such as identifying surgical instruments, layperson and crowd annotators can annotate at a level similar to that of surgeons [3][4][5][6][7]. We caution that in many of these studies, videos were pre-edited by clinical experts to only show brief video clips of the procedure's critical portions. This pre-editing makes annotations by lay annotators easier to perform as it constrains the data to be more clinically relevant and serves to filter some of the noise in the data.
Furthermore, crowd annotators may exploit class imbalances in the data to maximize their percentage of correct annotations, preferentially selecting labels that are more likely to be prevalent in the data. For example, crowd annotators may liberally annotate the presence of a 'grasper' in a video because graspers are such common instruments. In an additional illustration of this point, Deal et al. demonstrated that while crowd annotators and clinical expert annotators had a good degree of correlation in their assessment of the quality of the critical view of safety (CVS) attained in laparoscopic cholecystectomy, crowd annotators were less likely to recognize high CVS scores and more likely to give an 'average' score (3 or 4 out of 6) than clinical expert annotators. Furthermore, crowd annotators were less likely to be able to identify poor quality CVS when compared to clinical expert annotators and favored again giving an average score [4,8]. The 'average' score is less likely to be perceived as incorrect or otherwise flagged in statistical analysis  for outliers in annotations. Thus, while simple tasks may be handled reasonably by crowd annotators, for more complex tasks such as identifying anatomy and the quality of a dissection, crowd annotators' results can vary from those of clinical expert and trainee annotators. Careful selection of annotators is, therefore, necessary. Experienced surgeons are costlyboth from a financial perspective if reimbursing them for their time spent on annotations and from an opportunity cost perspective, where time spent annotating video is time away from treating patients. Thus, while it may be sufficient to have the crowd annotate low level items of interest such as tools, it is likely necessary to have clinical experts or trainees, to annotate more complex phenomena. Clinical expert and trainee annotators have surgical experience and possess a broader understanding of surgical principles that allows them to interpret segments of video that may not cleanly fit annotation definitions. Additionally, more experienced clinical annotators can provide insight into new labels that may need to be created. For example, an annotator with limited surgical experience may either incorrectly label (e.g. believing a bladder neck reconstruction is part of an urethrovesical anastomosis) or be unable to label a segment. A clinically experienced annotator would instead see the related set of actions as being distinct and requiring a different, novel label.
One possible solution to the cost is a 'two-pass' method to video annotation [2]. On the first pass, a layperson annotates the video to the fullest of best of their ability and highlights areas in which they have uncertainty. This first annotation set could be generated from a crowd annotator, or even an automated machine learning model trained on a small amount of data. As proof of concept, some automated models can identify surgical phases with datasets of under 100 videos; however, these works do not all explicitly define the qualifications of the annotators they utilized [9] .On the second annotation 'pass,' the clinical expert annotator could rapidly annotate the video by only reviewing areas of uncertainty, the boundaries of start and end of phases, and areas requiring expert knowledge (e.g. the steps of an intracorporeal anastomosis). To ensure quality of the 'first-pass,' clinical expert annotators could be used to review and verify (i.e. audit) a subset of the first-pass annotations. Additional research is needed to determine how much of the data would need to be audited to ensure data quality. Such audits also raise the issue of how best to assess the agreement or inter-rater reliability between annotators.
While early work in the field utilized a single clinical expert annotator to ensure consistency of annotations across all data [10], subsequent research has since incorporated multiple clinical expert annotatorsboth to spread the burden of annotation and to allow for measurement of the potential reliability of annotations [11]. Assessing differences between annotators allows one to determine whether the definitions of the phenomena of interest were appropriately applied or understood. Depending on the type of annotation under investigation, different metrics can be calculated to assess inter-annotator reliability. At perhaps the most basic level, a simple percent agreement can be calculated. However, this does not account for potential agreement that can occur by chance alone. Therefore, various statistical measures of agreement can be considered, such as Cohen or Fleiss's j, Krippendorf's a, or intraclass correlation coefficient [11,12]. An in-depth discussion of the appropriateness of individual metrics for a given situation is outside the scope of this article; however, it is important to consider aspects such as the number of annotators and the prevalence of a given annotation [13,14]. Reassuringly, in a preliminary study, pooling annotations from multiple clinical expert annotators did not result in a decrease of the trained model's performance [15]. To help reduce variation across annotators, it is critical to precisely define the phenomenon of interest that is to be annotated. The critical challenge in developing annotation guidelines is that they require the annotator to know, from the video alone, the surgeon's intent. Therefore, surrogates of surgical intent from video cues alone must be identified for accurate video annotation. These surrogates, often referred to as anchors or definitions, are used by annotators to determine how to classify a procedure's video segments (e.g. from time t 1 to t 2 the surgeon completes a gastrojejunal anastomosis) or spatial elements (e.g. the pixel at x 1 ,y 1 ,z 1 denotes part of structure A). Defining these anchors often incurs a tradeoff between inter-annotator reproducibility (and, therefore, an algorithm's performance) and capturing clinically meaningful phenomenon. Consider labeling the process of creating a gastrojejunal anastomosis. This step 'starts' when the surgeon decides to begin the anastomosis, for which there are no visible video cues. An annotator would have to guess or otherwise infer when the surgeon makes this decision, creating significant variability. To create a reproducible annotation, the step's start could be when an instrument that creates the enterotomies first touches the tissue. However, while reproducible, this fails to capture intent and results in a definition most surgeons would say is too narrow and loses important information (e.g. orienting and selecting the ideal loop of bowel). Finally, the wide variety of surgical techniques can result in visually distinctive segments between different surgeons and institutions that may need to be classified separately.
As detailed in the subsequent challenges on spatial and temporal annotation, the balance between having flexible definitions that preserve clinical relevance and precise definitions that improve inter-rater reliability can be optimized by considering the specific phenomena of interest. Some variability in annotating clinical phenomena may be unavoidable as such variability may reflect inherent differences in the conceptualization of such phenomena by surgeons. For example, surgeons may differ in their interpretation of the correct surgical plane (i.e. the potential space between two structures through which a dissection can be performed) or in the amount of bleeding that qualifies as clinically significant. These underlying differences could provide clues on the difficulty of a surgical situation (e.g. significant adhesions or inflammation) and may require additional annotation from human experts. While high variability between annotators in such edge cases might threaten a project seeking to utilize automated methods, it can also serve as a useful metric to more closely study a clinical phenomenon through other methods that may be more appropriate (e.g. qualitative methods).

Challenges in spatial annotation
Spatial annotation refers to the annotation of the spatial information (e.g. position, region of interest) of specific elements such as anatomy, tools, or visually salient events (e.g. blood) without necessarily including a consideration of the temporal manner in which such elements may arise. At first glance, annotation of structures along a spatial coordinate system would seem to be straightforward; however, several considerations arise when evaluating annotations created with minimal guidance.
As with any spatial annotation task, the phenomenon of interest for a research task should be welldefined a priori. The importance of determining first the phenomenon of interest as opposed to the phase or workflow of interest is exemplified by the task of identifying the critical view of safety (CVS) in laparoscopic cholecystectomy. The CVS is defined by the Society of American Gastrointestinal and Endoscopic Surgeons as a method to identify the cystic duct and artery during laparoscopic cholecystectomy. More specifically, the view that must be obtained is defined by achieving the following three criteria: (1) the hepatocystic triangle is cleared of fat and fibrous tissue, (2) the lower one third of the gallbladder is separated from the liver to expose the cystic plate, and (3) two and only two structures should be seen entering the gallbladder. For researchers interested in annotating the CVS, they must consider how these criteria will be applied based on the phenomenon or question of interest.
For classification tasks, it may be sufficient to simply collect a dataset containing images of CVS. One must then consider what quality of CVS has been attained as not all critical views are created equal and a grading system has been proposed to identify different qualities of CVS ( Figure 1). Examples of high-quality CVS may rarely be found in existing datasets. Furthermore, many surgeons rarely strive to pursue the highest quality CVS, instead choosing to obtain a CVS sufficient for their level of comfort in identifying the key structures. Thus, in this example, one should determine a priori whether the goal is to classify high quality CVS only or a range of CVS quality.
To further extend the example of identifying CVS, consideration should be made for the granularity of spatial annotation necessary. Bounding boxes may be sufficient to identify large aspects of anatomy such as whole organs (e.g. the gallbladder) or tools (e.g. Maryland grasper), but in the case of CVS, semantic segmentation may be more appropriate where the use of bounding boxes may lead to incorrect or overlapping annotation of structures ( Figure 2). Further, for granular annotations of anatomy, some structures may have clearly demarcated 'starts' and 'ends' (i.e. the edge of the liver) whereas other structures may be less discrete. For example, when annotating the cystic artery, the connective tissue surrounding the artery may make labeling the structure difficult as the border between the artery and the gallbladder may be 'fuzzy'. Approaches borrowed from surgical education, such as visual concordance testing, may help to better delineate these 'fuzzy' borders to arrive at a consensus annotation [16] .Clinical expert annotators may be able to better evaluate this border but there will likely be bias in how an image is labeled, particularly in datasets where videos are labeled by a small group of annotators.
Spatial annotation in video can be tedious as objects may have to be tracked over large periods of time with the average video consisting of 25-30 frames per second (fps). Certainly, there may be no need to sample frames in real-time, and some phenomena can be sampled at only 1-2 fps. This again calls for consideration of the specific clinical phenomena for which the annotations are being generated. Software tools that assist with automated tracking of objects can be used but may also require auditing and correction.
One should also consider how to annotate some of the more abstract spatial characteristics that are perceived by surgeons, including the concepts of surgical planesthe potential, avascular interface/space that exists between structures or different types of tissues, retraction, and exposure. While these are largely spatial concepts, each of these can change in slight but important ways with time. As such, the challenges in annotating these characteristics will be described separately below.

Challenges in temporal annotation
An area of particular interest within the surgical community is understanding surgical workflow. Initial forays into workflow analysis (also known as surgical process modeling) were established by the work of Pierre Jannin and his group and has subsequently been extended with the goal of pursuing a common ontology [17]. As with spatial annotation, temporal annotation brings with it many challenges to carefully consider prior to implementing and sinking time into annotating a large number of videos.
The importance of determining first the phenomenon of interest as opposed to the phase or workflow of interest is again highlighted and drives how temporal annotations may be defined. One needs to consider if the consumer of the annotations is acausal or causal. An acausal consumer has access to the entire video and can use past and future video frames to help identify the phenomenon of interest in the current video frame. Causal consumers, unlike an acausal one, only know the past and current frames of the video, so they cannot use future video events to help with decision making. An example of an acausal consumer is an algorithm that automatically labels the steps of a video found in a surgical video library. The algorithm can use information from the entire video to precisely label the start and end times of phases. For example, it often is hard to identify when a phase is finished since there are instrument exchanges in the field and no clear visual cues to a phase transition. Knowing exactly when the next phase starts in the future (since it has access to the future video frames) allows the acausal labeling model to accurately determine the phase's end. A causal consumer, on the other hand, might be an algorithm used in the operating room in real-time to help surgeons with their decision processes. Just like the surgeon, this causal AI model will not know the future events, and therefore need to 'think like a surgeon' using only information from the past video frames.
The intended use of the annotations, be it for a causal or acausal consumer, will heavily impact the definition of annotations and the process for generating them. For example, if one defines 'Dissection of Calot's Triangle' as the appearance of a dissecting tool on screen during active dissection, during an operation the surgeon may change from dissection to using the tool to remove an adhesion. To label this phase transition, the annotator, knowing only the video frame, will then need to rewind, end their annotation of 'Dissection of Calot's triangle,' and then relabel the next portion as 'Remove adhesion.' It is important to realize that by rewinding and modifying their previous annotation, they are now performing an acausal annotation. If this annotation is used to train a causal real-time algorithm, sub-optimal identification around the boundary of steps can occur, since the causal algorithm will not be able to account for future events. If the algorithm is tweaked and made acausal, better performance may be seen [18]. Both causal and acausal annotations are acceptable depending on the ultimate goal of their application. This phenomenon leads to a rule of thumb to achieve maximal algorithm performance: acausal annotations should only be used for acausal applications, while causal annotations can be used for both causal and acausal algorithms.
Temporal annotations, like all annotations, are difficult to define in a manner that leads to consistent annotations from multiple annotators. Extremely precise start and stop times (e.g. only when the instrument is touching the tissue), can make causal annotations more reproducible between annotators. However, these styles of annotations are not only tedious to perform but may also be of limited clinical utility. Another possible solution is to consider whether some overlap in phases is acceptable or whether some phases can be combined. Consider the case of isolating the cystic duct and cystic artery in laparoscopic cholecystectomy. Dissecting the fatty, fibrous tissue between the duct and artery may help to isolate both structures. In this case, phases could be combined into 'isolation of cystic duct and artery' or could be further divided into 'isolation of cystic duct,' 'isolation of cystic artery,' and 'isolation of cystic duct and artery.' Datasets such as Cholec80 take the approach of having more general phases to work around such difficulties [10]; however, this may limit their application to more precise clinical challenges such as decision support. Once again, clearly defining the phenomenon of interest is important to determine the level of annotation that is required.
Anchoring phases around the presence of instruments can provide concrete cues to annotators about the start/end of a phase. However, the presence of a surgical instrument alone does not define an operative phase in the mind of a surgeon. Rather, the tool is selected to achieve the goals of the phase. Thus, there may be situations such as in cholecystectomy when a scissor is introduced not to cut the cystic duct or artery but to open more of the peritoneum overlying the gallbladder. Consider phases to be less about which instruments are in the video and more about how instruments interact with tissue in the operative field to yield a given phenomenon (e.g. dissection, exposure, resection, etc.) [19]. Such consideration should allow for more clinically applicable annotations.
Additionally, temporal structure in an operation can be considered hierarchical. That is, a phase may consist of different steps which are performed by engaging in various actions. Prior work has described atomic surgical gestures, also known as surgemes, in terms of kinematics of robotic procedures [20]. Such gestures can fit into a temporal hierarchy of workflow as gestures can combine to yield an action, which is performed as part of an operative step. There is ongoing work at the Society of American Gastrointestinal and Endoscopic Surgeons to create clinically grounded definitions for a hierarchical structure of temporal events in an operative video.

Challenges in annotating clinically meaningful events and characteristics
Annotation of clinically meaningful events is one of the foremost challenges in surgical video annotation given that there is limited consensus on what constitutes clinically meaningful. Consider the case of bleeding as an example. Bleeding occurs when blood moves from the lumen of a blood vessel into the surgical field; however, clinical context is extremely important in judging if this movement of blood is potentially deleterious to the patient or an expected ooze of little consequence to clinical outcome, especially if bleeding is to be used as an event to help guide or assess surgeons. For example, while performing an anastomosis, there may be bleeding from the edge of an enterotomy: this can be an expected or even positive sign, as it indicates the tissue has adequate perfusion. However, if bleeding occurs a few millimeters away, say from tearing of the bowel or the mesentery by a grasper, this could be considered an adverse event. Other questions include: how much blood loss is potentially harmful to the patient? What rate of blood loss can be temporarily ignored or otherwise expected to be self-limited? This necessitates understanding of context as, for example, the amount of expected bleeding in an appendectomy is significantly different than a liver resection. Additionally, even in the context of a single procedure type, patient factors such as inflammation can significantly affect the judgment of which episodes of bleeding are considered expected versus unexpected.
Jung et al. demonstrated that in procedures where there is limited bleeding to be expected, consensus on classifying bleeding events and their severity can be achieved with highly trained raters [21]. However, this does not address the issue of scalability in annotating events that require a significant amount of clinical knowledge from the annotator or the consideration that this could be an extremely tedious annotation to perform. Leveraging active learning and semi-supervised approaches can reduce the burden of labeling events, anatomy, tools, and phases [22][23][24]; however, training a model to recognize clinically relevant events can be problematic. How does a model discriminate between pooled blood and a slow ooze? Does every bleeding event necessitate review by a trained evaluator? Such considerations must be clarified ahead of annotation to ensure inter-rater reliability and consistent detection of ground truth phenomena.
The annotation of intraoperative adverse events (iAEs) other than bleeding presents similar challenges. Most of the work around identifying iAEs has been completed using data from large claims databases or prospective clinical registries [25,26]. Intraoperative adverse events that may be detected in operations include clear examples such as enterotomy, inadvertent thermal injury, or other unintentional damage to an organ (e.g. ureter, spleen, liver, bile duct, etc.) and more nuanced examples such as serosal injury to bowel (e.g. from excessive traction) during adhesiolysis or inadvertent spillage of bile from the gallbladder during cholecystectomy. Many surgeons may not consider spillage of bile from the gallbladder itself to be an adverse event while others cite a possible increased risk of postoperative fluid collection as reason to consider spillage to be adverse [27]. Similarly, the perception of operative planes, adequacy of retraction and exposure, and the characterization of tissues can vary across surgeons [28]. The identifiability of clinically meaningful phenomena also presents a tremendous challenge for their annotation and successful deployment in an AI model. Phenomena are identifiable, ifbased on the data, labels, and AI modelthey can reproducibly be identified. The patient's favorite color, for example, is a non-identifiable phenomenon in the context of surgical video analysis. Many surgical events and areas of interest, unfortunately, are either non-identifiable or poorly identifiable due to their limited visual recognizability. For example, it is often difficult to recognize, even for advanced trainees, the subtle difference between being in the correct versus incorrect surgical plane. We see this daily in the operating room manifested as the 'sixth sense' of the expert surgeon. Thus, we reiterate the importance of defining the clinical phenomena of interest that are visually identifiable in advance and in a manner that is clear to annotators. With these types of annotations, having additional annotator training beyond just providing annotation guides and definitions can allow for iterative improvement in annotation quality across a group of annotators [2].

Challenges in annotating surgical performance
Annotation of surgical video for surgical performance has been a longstanding practice in surgical education. The goal of these annotations, historically, has not been to train an algorithm to perform automated assessment but rather, to provide structured, formative feedback to surgeons. Methods of assessment can be divided into global and procedure/task-specific assessments, with several different assessment tools that have been validated for the purposes of distinguishing experienced and inexperienced surgeons. Kinematic approaches such as assessment of motion have also been used to differentiate experienced from novice surgeons and to track the learning curve [29,30].
The Objective Structured Assessment of Technical Skills (OSATS) is perhaps the most reported assessment tool used in surgery. It provides a global rating scale for assessment of surgical skill, including measures such as 'respect for tissue,' 'instrument handling', and 'flow of operation.' Each of these elements is rated on a 5-point Likert scale with anchors at 1 (poor performance), 3 (acceptable performance), and 5 (superior performance). As demonstrated in Table 3, while anchors are provided to help annotators better understand the performance sought, scoring is ultimately subjective and can be influenced by the surgical experience and expectations of the annotators. Similar issues exist for more domain specific assessment tools in laparoscopic and robotic surgery. Thus, while research is underway to develop summative assessment tools for specific operations [31,32], assessment of performance that is based on rating scales will continue to have elements of subjectivity. As long as it is appropriately considered in analysis, such subjectivity could actually enrich assessment if it allows for feedback to be provided to the surgeon.
Full, unedited video serves as the raw data source for assessing surgical performance and allows for assessment of the entire procedure, including both technical and non-technical aspects of a surgeon's performance. However, annotation and analysis of entire cases can be time-consuming, with operative times ranging from 20 min for short procedures to several hours for complex cases. The use of short segments of cases, edited together as a synopsis, has been explored in the surgical education literature, but reports have noted that assessments of edited videos have poor inter-rater reliability and low discriminative ability in distinguishing trained versus untrained participants [33]. Thus, while short segments may be sufficient when assessing specific tasks such as intracorporeal suturing, full videos are likely necessary to appropriately annotate performance on a procedure as a whole.
Ultimately, assessment of performance will likely be tied to clinical outcomes. The correlation between ratings of surgical performance and clinical outcomes has been well documented [34,35], raising the possibility of eventually annotating performance based on expected clinical outcome for patients rather than the subjective rating of human (or machine) graders. However, limitations in reporting of outcomes relative to surgical performance have thus far affected its application. The heterogeneity in assessing rate of learning of skills across surgeons and specialties has made it difficult to specifically identify learning curves for many procedures [36], and lack of clarity around where a surgeon may sit on the learning curve may affect expected outcomes. Variation in performance by a single, experienced surgeon across cases may lead to differences in outcome as can differences in the complexity of a patient's presentation [37]. Finally, it is difficult to attribute causality in outcome of a patient to the intraoperative phase of care and surgeon's performance alone. Factors such as the patient's comorbidities, postoperative care, and effect of other providers, play a role in the clinical outcome of the patient. Therefore, approaches that isolate a surgeon's contributions only -whether through rating scales, kinematics, or computer vision -may only provide a partial contribution to a patient's expected clinical outcome. Given these challenges, annotation of surgical performance relative to clinical outcomes alone remains an elusive ideal that requires significant further investigation.

Challenges in annotation tools
Armed with a video dataset and well-planned annotation schema, the surgical annotator must then put  [38]. Rudimentary annotation can even be performed with spreadsheets, simply by entering data and timestamps into data cells.
A user-friendly and efficient annotation tool can make-or-break the annotation process. Always evaluate a software prior to committing to its use for a project. Features to consider include a program's interface, process for loading videos, annotation export formats, and ability to integrate AI models to facilitate annotation. Its interface must be user friendly to the annotator (be it layperson, clinical trainee or clinical expert,) and run across different operating systems. In order to annotate, it must have access to videos and be able to play a wide range of video formats. Some software can even load videos from a centralized video repository rather than requiring annotators to have video copies present on their computers. This centralized storage keeps videos in one secure location, which minimizes chain-of-custody issues with regard to privacy laws. The software must also be able to export the data in a format that the AI model can preferably directly load, and if not, at least a format with bindings in common programming languages so it can be easily modified for model input. Additionally, enabling version control of annotations and the data dictionary is important as both will need continuous updating. Often, determinations are made to change the way a dataset is being annotated in order to improve algorithm development. For example, annotations may be too generic for current machine learning technology to learn from them; or annotations may not be purposely suited to the problem being solved. Ensuring efficient and accurate updates to labels will enable more accurate data and reduce the need to re-annotate entire datasets. Lastly, if the annotation task is to be performed on a largescale, AI models, as mentioned previously, can 'preannotate' the dataset. Some software facilitates this pre-annotation task, allowing it to happen directly in the model, which allows for substantially faster creation of annotated datasets.
Unfortunately, the ultimate annotation software has yet to be created. No publicly available software can annotate images and videos with spatial and temporal annotations, by multiple annotators for the same video, from a centralized video repository, with easy annotation export, and annotation assistance by AI models. Until such a tool is created, current users must either use nonpublic industry tools (if they have access to one) or the limited publicly available ones, considering the tradeoffs listed above.

Next steps forward
Given many of the challenges we have reviewed above, there is clearly a need to establish consensus around the development and use of surgical annotation. Efforts have been underway to bring together the surgical data science community with the goal of moving forward from concepts in data science to actions required for the translation to clinical investigation and ultimate application to patient care [39,40].
We highlighted the importance of clearly defining the clinical phenomena of interest in executing annotations of surgical video. Clearly outlining and defining the concepts that exist within and across operations in a manner that is accessible to all researchers is a key component of enabling multi-institutional, multidisciplinary research that can be compared, contrasted, or combined. OntoSPM is an ambitious project that aims to outline a core ontology for surgical process models to enable large scale research efforts across groups [17]. In 2020, the Society of American Gastrointestinal and Endoscopic Surgeons convened a consensus group of surgeons and engineers to draft recommendations on the annotation of minimally invasive surgical videos, including both spatial and temporal annotations, the results of which are pending publication.
It is important to consider, however, that some annotator variability may not be the fault of poorly defined phenomena. Rather, such variability may reflect the 'fuzzy' nature of a phenomenon itself [41]. Even experienced surgeons may differ in their conceptualization of some phenomena, such as safe and unsafe zones of dissection or identification of specific anatomic structures [16,42,43]. Thus, combining annotations to serve as a fuzzier ground truth or to establish thresholds of agreement as ground truth may serve to either enhance modeling of clinical phenomena that are, by nature, fuzzy or provide a more realistic benchmark for model performance (i.e. to compare agreement of a model to multiple annotators vs. a single annotator) [11,16,44].
One must also consider the downstream biasing effects of data and annotations. All models, ML or not, are inherently biased: they are simplified, compressed, representations of reality learned from limited information. Even unsupervised learning models that do not use annotations have bias, as they learn from an inevitably unrepresentative subset of surgical videos. Tremendous thought must be put into building diverse, representative datasets, not just those from 'perfect' cases. Similar care must be given to defining widely applicable annotation labels. We do caution that, even with the best of efforts, these models will be biased. Studies into the effects of bias is an active and critical research area that will ensure the fair and effective deployment of AI into the operating room.
Finally, some elements of annotation of surgical video remain to be clearly defined (e.g. clinically meaningful events ranging from bleeding to bowel injury to retraction and exposure). While these types of events can be defined internally within a given study, scaling research efforts to enable translation to clinical practice will require at least some consensus on how such events should be annotated. Partnership between surgical data scientists, practicing surgeons, and health services and surgical education researchers could yield fruitful discussion and consensus on how to handle these types of events to enable consistent annotation across fields.

Conclusions
The rigorous application of surgical video annotation will be important to further advance the field of surgical data science, particularly as it relates to research on development of computer vision applications. In designing research to develop and validate such applications, researchers should consider carefully the specific phenomena of interest to determine whether ground truth annotations appropriately represent those phenomena or whether they represent alternative phenomena outside the scope of interest. Additional work will be required to build consensus across disciplines on annotation of clinically meaningful events and surgical performance, as these concepts across disciplines ranging from surgical data science to surgical education and health services research. Consensus efforts across disciplines offer an opportunity to impact a wider scope of work beyond automated surgical video analysis.