Rotation Correction Method Using Depth-Value Symmetry of Human Skeletal Joints for Single RGB-D Camera System

Most red-green-blue and depth (RGB-D) motion-recognition technologies employ both depth and RGB cameras to recognize a user's body. However, motionrecognition solutions using a single RGB-D camera struggle with rotation recognition depending on the device-user distance and field-of-view. This paper proposes a nearreal-time rotational-coordinate-correction method that rectifies a depth error unique Microsoft Kinect by using the symmetry of the depth coordinates of the human body. The proposed method is most effective within 2 m, a key range in which the unique depth error of Kinect occurs, and is anticipated to be utilized in applications requiring low cost and fast installation. It could also be useful in areas such as media art that involve unspecified users because it does not require a learning phase. Experimental results indicate that the proposed method has an accuracy of 85.38%, which is approximately 12% higher than that of the reference installation method.


INTRODUCTION
Concomitant with technological advances focus has been placed on making human-computer interactions (HCIs) simple, rapid, and easy to understand [1]. In particular, human tracking was developed as a control method in the video game industry [2]. Advances in human tracking have led to the development of gesture interfaces and, subsequently, realization of the concept of a natural user interface [3], which has been studied since 1971.
Recently, with the development of infrared (IR) sensors and various image-processing techniques, the focus has been on devices that do not need to be mounted on the human body. One example is Kinect, a motion-and voice-recognition device based on the combination of redgreen-blue (RGB) images and single depth images [4]. Kinect, originally developed as a game controller, provides good recognition rates despite using an inexpensive sensor. Soon after its launch, Kinect was hacked, which prompted Microsoft to patch the vulnerability and release the Kinect software development kit (Kinect SDK). Example of an application system using a Microsoft Kinect device, which allows users to interact directly with both virtual and real objects in industrial Internet of things (IIoT) areas [5]. (Right) Example of an indoor media art installation based on a Kinect Pixel Mirror [6] Kinect was soon applied in various fields, including the arts [7][8][9][10], industries [11][12][13][14][15][16][17][18][19][20], and healthcare [21][22][23]. It is widely used in various fields where intuitive HCI is required, as shown in Fig. 1, because it is a low-cost and relatively accurate motion-recognition device. However, in contrast to other motion-capture devices, Kinect creates 3D coordinates using a single 2D RGB image to determine the x, y coordinates and a single depth image to determine the z coordinate [2]. Therefore, to utilize Kinect, the user, Kinect, and screen must face each other, as shown in Fig.  1. Consequently, Kinect has installation limitations, and the recognition rate is significantly lower if they are violated. The limitations are caused by problems with rotation recognition, which cause a depth error unique to Kinect, and the requirement that Kinect, the user, and the screen must be aligned in a straight line. In addition, the original application of Kinect indoor video game playing can also be hindered if the user does not have sufficient space [2]. A fixed-coordinate-correction method [24], multiple cameras [25][26][27], and Kinect fusion [28] have been proposed as correction methods to overcome the installation limitation and decreased recognition rate. However, these methods employ a polar-coordinate system, which differs from the x, y, z coordinate system used in the Microsoft Kinect SDK. Although Microsoft released Kinect v2 for Windows and Xbox One, which addressed the recognition-rate problem to an extent [29], its installation locations are also limited [27,30]. Moreover, various coordinate-correction methods that use cumulative data for the same learning method or converted coordinate systems have been studied [13,[31][32][33][34] with the same test sample. These methods, based on machine learning or multiple sensors, exhibit high accuracy (between 83% and 93%) but are not suitable for some application fields where machine learning is difficult to apply. As shown in Fig. 1, fields using motion-recognition interactions, such as IIoT and media art, require high accessibility, easy and fast installation, and rapid prototyping, which cannot be achieved with these machine learning-based coordinate-correction methods [35].
In this paper, we propose a near-real-time rotationcorrection method using the depth-value symmetry of human skeletal joints with the unique Kinect coordinate system in a single RGB-D camera system. The proposed method can be quickly deployed in the existing Kinect applications without converting the unique Kinect coordinate system, and it enables quick installation, improved accessibility, and rapid prototyping by correcting coordinates without using machine learning or a sensor network. Furthermore, the proposed method is calibrated using only the upper-body coordinate features of the human body, and it can reduce the throughput and support an upper-body-only mode in Kinect for Windows. In addition, the proposed method is intended for correction at distances of 1 -2 m, where the unique depth error of Kinect occurs [28]. An experiment was conducted for a distance of 3 m to demonstrate the reduction of the unique depth error of Kinect. The measured results were characterized by their standard deviation, standard error rate, and average error rate.

RELATED WORK 2.1 Motion Recognition Based on HCI
Gesture recognition without a wearable device is enabled by image recognition-based HCI, which helps the user adapt quickly and makes the system easily accessible [29]. Although a case of inputting the hand motion of a user via a mouse using general RGB cameras has been reported, it is difficult to separate user movements from background images in complex environments [36]. To overcome this issue, Kinect detects objects using both RGB and depth images, as shown in Fig. 2. The simultaneous use of both images facilitates the recognition of people and backgrounds even under poor lighting. As shown in Fig.  3a, Kinect uses a rectangular coordinate system for the x and y values, which are detected using an RGB camera. In contrast, it uses a spherical coordinate system for the z values (depths) of the background and users, which are separately detected using an IR radiation device and IR camera, respectively [2]. However, given that depth recognition using IR reflectance is based on real-world coordinates, we chose to use the polar-coordinate system, as shown in Fig. 3b. The unique combined coordinate system of Kinect is excellent in single-plane applications for tracking objects in front of it [2], which is why Kinect can provide goodquality 3D motion tracking as a single low-cost device. However, according to related research [2,24,30,37], it produces numerous errors when tracking users who are rotated at angles with respect to the Kinect, rather than facing it directly. Methods such as multi-tracking and coordinate-system transformation have been used to overcome this disadvantage [24][25][26][27]. The polar-coordinate transformation method requires high-level understanding of the unique Kinect coordinate system, several computation processes [37][38][39], and an accurate installation angle [40].
A rotational transformation in Cartesian coordinates is expressed as follows: As shown in Eq. (1) and Fig. 4, the rotational transformation requires the rotation angle θ. According to Kumar et al. [44], the Kinect coordinates do not include θ. Thus, θ must be derived from two different 3D transformations, making intuitive installation and fast user interaction difficult.
In the case of multi-Kinect tracking [24][25][26], the use of more than one Kinect sensor network will result in further space constraints. Although this method is highly accurate, the relative positions of each pair of sensors must be accurate, and it is difficult to obtain the exact position of a Kinect device if it is reinstalled or if the user changes.

Combined Coordinate System
Kinect uses a unique combined 3D coordinate system ( Fig. 5) created by combining the values obtained in two different coordinate systems [24]. The Z value of this coordinate system is obtained using the Euclidean distance from the origin in 3D spherical coordinates. This value is combined with 2D rectangular coordinates to obtain the 3D coordinates. As shown in Fig. 6, this unique coordinate system has a depth error that depends on the distance between the device and the user as well as the field-of-view (FOV). This depth error limits the installation and utilization of a single Kinect. The user, Kinect, and screen must be in a straight line to achieve accurate results.
The unique combined coordinate system of Kinect is used despite these disadvantages because it does not require wearable equipment or remote sensors. Thus, the equipment size can be reduced while maintaining rapid and accurate measurements in limited installations, unlike existing tracking equipment such as RGB cameras, motion trackers, and controllers.

Tracking and Application Method of Kinect
According to the patented in-house depth-camera calibration of Microsoft [2], the basic tracking method involves the placement of equipment at least 2 m in front of a user, as shown in Fig. 7a. This tracking method does not have a shade area for most human motions and is recommended to obtain the best recognition rate when using Kinect. However, Kinect has a significant limitation in terms of its installation environment and usage because the Kinect, screen, and user must be aligned in a straight line [2,11].
An alternative tracking method involves installing equipment at a right angle (90°) to the line connecting the user and screen, as shown in Fig. 7b [2,11]. This method can overcome a few installation limitations. However, it does not overcome the depth error shown in Fig. 6.

Gesture Interface Recognition and Subject Rotation Recognition
To facilitate gesture recognition at various angles using Kinect, various methods have been applied, such as multilayer perception, artificial neural networks, and support vector machines (SVMs) [31][32][33][34]44]. Tab. 1 summarizes the approaches, datasets, and accuracy of the previous studies that have attempted calibration using alternative methods. Numerous studies have been conducted on rotation recognition, most of which involved the recognition of gestures or sign language. Monir et al. [32] measured standing and sitting positions at distances in the range 1.3 -3.5 m. They found that full-body tracking at 2 -2.5 m yielded the highest accuracy. Tracking is difficult if the distance is too high or too low, if tracking is performed for only a part of the body, or if a rotation exists between the subject and the device. In the studies listed in Tab. 1, the distance [19,31,33,34,44] or angle [19,[31][32][33][34] between the subject and device was considered constant. Sign language recognition using hand rotation and gestures involves the use of a learning algorithm to compensate for the low initial recognition accuracy [33,34,44].

PROPOSED COORDINATE-CORRECTION METHOD
In this paper, we propose a rotation-correction solution for a single Kinect device based on the symmetry of the depth values of the human skeleton. The proposed solution focuses on the distance and rotation errors mentioned in Section 2 to enable fast installation and easy access to applications. It is also an initial-value-correction solution without a learning process, which is beneficial for applications involving unspecified users, such as media art or sculptures installed in public places.
The human body exhibits overall organic movements, rather than individual movements of one part or organ. However, some body parts, such as the shoulders of ordinary people without trauma or damage, are usually on the same depth plane in a symmetrical state. Therefore, we set a calibration reference point in the human body. Fig. 8a shows an example of a skeleton recognized by Kinect when the attention stance is tracked using the recommended front-installation method. The right side of Fig. 8a shows a simplified model of this skeleton, and the simplified model has the same depth values as the original model. Thus, the coordinate values can be expressed as follows: Owing to the characteristics of Kinect, the tracked x, y, z coordinates are twisted more frequently when the measurement is performed with the user at a greater rotation, as shown in Fig. 8b. This behavior is caused by the unique combined coordinate system of Kinect, in which the 2D rectangular coordinates are superimposed with the depth values in 3D spherical coordinates, without using the rotation equations or the same reference point. Therefore, when a user rotates, Kinect does not recognize this rotation. Instead, it registers a narrowing of the human body [2].
We used linear proportionality constants for the shoulder-arm ratio, distance, and rotation to compensate for the rotational errors caused by these structural characteristics. When the user rotates, the coordinate values are as follows: Our approach is based on the fact that when a certain part of the human body, such as the shoulders or arms, is scanned in the Kinect reference position, it exists symmetrically on the same depth plane [41,42]. The following equation can be derived from the data obtained using Eq. (2) to Eq. (4): As expressed above, the opposite side is calibrated based on a depth value from one side (the left or right side) by using a proportionality equation to allow for recognition as if both sides were symmetrically located at the same depth. Each coordinate modified into the same depth plane can be expressed as follows:    Eq. (6) and Eq. (7) express the depth values at the left and right sides, respectively, when the angles are modified by calibrating the opposite side. In the proposed method, Eq. (6) and Eq. (7) are used when the Kinect is located on the right and left sides of a subject, respectively. The proposed method should be corrected by considering the shoulder of the user that is closer to the device as the reference point. Otherwise, as shown in Fig. 6, the method is affected by the depth error due to the distance and FOV. This characteristic results in inverse correction, which increases the error rate.

EXPERIMENT 4.1 Experimental Method and System Setup
In this experiment, we used Kinect v1 because it is the most popular RGB and depth (RGB-D) dataset motioncapture device and is used in various fields [36,[43][44][45]. The unique depth error rate of Kinect has been reduced in Kinect v2 [29]. However, the resolution of the IR camera is 512 × 424 pixels, which is not directly proportional to the 1920 × 1080 pixel resolution of the RGB camera. The resolution of the RGB camera is common, but that of the IR camera is uncommon, which can make it difficult for end users to gain easy access to development and application [29].
Wasenmüller and Stricker [42] reported that Kinect v2 has a relatively constant depth-measurement accuracy compared to Kinect v1. However, it is affected by temperature, color, and multipath interference effects. It is also difficult to improve image recognition through algorithms. They also explained that devices based on Kinect v1 are suitable for fast access to applications and rapid prototyping.
Other 3D skeleton capture devices based on Kinect technology, such as XTION and XTION2 from ASUS and PrimeSense, use IR cameras with common resolutions of 640 × 480 pixels for depth recognition, similar to Kinect v1 [2,40]. The basic principle of this device is the same as that of Kinect v1. Therefore, investigations using Kinect v1 can be expected to provide solutions that are compatible with other devices, such as XTION. Fig. 9 shows a block diagram of the program we developed for measurement. This program simultaneously measures the coordinates corrected by the proposed method and the uncorrected coordinates. It caches data every 500 ms and outputs and saves the data to a file upon completion. Fig. 10 presents a screenshot of the interface of the coordinate measurement program for simultaneously measuring the original and calibrated coordinates. The experiment was performed in an environment in which obstacles did not obstruct the body of the subject. We used a location in which objects were placed naturally, such as a room, because we inferred that the sensor might cause errors between the subject and neighboring objects. In addition, we performed the experiment in an environment in which shadows were not directed toward the sensor by illuminating the environment using an LED light from above.

Experimental Results
In the experiments, we selected the following three basic postures for actions performed by seven different people: attention posture, hands-up posture, and handshalf-up posture. Measurements were performed for each posture at a distance of up to 2 m. This distance range has the most significant effect on the depth error. Owing to the nature of Kinect, the depth error does not occur at approximately 3 m (Fig. 6). Therefore, for demonstration, we performed measurements of only the attention posture at 3 m, where the error value sharply decreased.
In this experiment, the Kinect device was fixed at a location, and the subject was made to look to the front (A). Subsequently, the subject rotated until the angle between the Kinect and the subject became 90° (Fig. 11a, Fig. 11b and Fig. 11c). Next, the subject faced the front again (Fig.  11c, Fig. 11b, Fig. 11a). This procedure was followed to measure the calibration accuracy for a range of motion.

Standard Deviation Analysis
This section presents analyses of the raw and calibrated data. In Section 2, we described the depth error based on the subject rotation and FOV. The error rate increases with the subject distance and rotation angle [2]. Consequently, the correction rate decreases as the distance increases. Tab. 2 lists the standard deviations for the corrected and raw data as well as the improvement due to the correction or correction rate. According to the table, the highest correction rate occurs at 1 m, and the correction at 2 m is less than that at 1 m. In addition, there is almost no correction at 3 m. Experiments were performed in the rotation range 0° -90°, as mentioned in Section 4.2.
The standard deviation is a measure of how close the correction data are to the frontal standard. Fig. 12 shows the standard deviation of the experimental results. From left to right, the data correspond to the 1 m attention stance, 1 m hands-up stance, 2 m attention stance, 2 m hands-up stance, 2 m hands-half-up stance, and 3 m attention stance. The standard deviation of correction data is reduced because the data are corrected close to the frontal standard. The most-corrected states are the 1 m hands-up and 2 m hands-half-up stances. The 1 m hands-up stance shows the highest correction rate among the experimental results. In this stance, the average error decreases by 52% from approximately 36.47 to approximately 17.15, and the standard error decreases by 42% from approximately 7.7 to approximately 4.4. The 2 m hands-half-up stance shows the lowest correction rate among the experimental results. In this stance, the average error rate is 31%, corresponding to a correction rate of 23% with respect to the raw-data value of 40.7. Although the error rate of 31% is acceptable for applications that do not require accurate interaction, such as media art, it is insufficient for applications such as games, which require highly accurate input. Fig. 13 and Fig. 14 show the standard and average error rates, respectively. The error rate is the lowest at 6.96 in the 1 m attention stance and the highest at 31.1 in the 2 m hands-half-up stance. At distances greater than 3 m, which are beyond the recognition range of the device, both the raw and correction values are 37.66, and the proposed method did not correct the data. For simple gestures at 1 m and 2 m, it is possible to obtain meaningful data for motion recognition with an accuracy of 80% -90%. However, for complicated motion at distances greater than 2 m, there is a possibility that motion recognition is not performed correctly, because the accuracy is 69%.

Error Analysis
According to a related study on rotation calibration using an SVM [44], the accuracy is approximately 40% in the first attempt and approximately 71% after 50 -60 iterations of machine learning. In the proposed method, the initial accuracy in the 2 m hands-half-up stance is 69%, which is similar to the accuracy of 71% in the results obtained after approximately 50 iterations of machine learning in the related study. However, the final correction rate is lower by approximately 10% because the proposed method does not involve post-processing.

CONCLUSION
In this paper, we introduced a rotation-correction method based on the depth-value symmetry of human skeletal joints in a single RGB-D camera system. The correction method utilizes the body specificity of the user and can easily be employed in RGB-D camera applications. The experimental results showed that the proposed framework is robust, with an overall accuracy of 85.38% and an average error rate of 14.62, which are better than the corresponding values obtained before the learning phase in most previous studies. However, the accuracy is 1.9% -7% lower than those in related studies in which long-term machine learning was applied.
Our method is advantageous for initial value correction in the short term, but its accuracy is lower than that of machine learning, which utilizes long-term and cumulative data. Nevertheless, the absence of a long-term learning phase makes our method useful for applications such as media art, in which users are unspecified. In the future, we will attempt to combine our method with machine learning. Because the initial calibration value of our solution is higher than the initial value obtained in related studies based on machine learning, we believe that the length of the learning phase can be reduced by using machine learning in combination with our method. Such a combination can potentially be applied as a short-term machine learning solution in areas such as smart industries, in which users can store and accumulate data.