-
Essay / Microsoft Kinect Sensor “Future Trends and Latest Research Challenges”
Table of ContentsIntroduction:Literature Survey:Research Proposal:Design Methodology:Software: Dot Net TechnologyPractical Experiments:Result and Discussion:Conclusion:With the invention of Low-cost Microsoft Kinect sensor, depth sensing and high-resolution visual (RGB) are now available for widespread use. In recent years, Kinect has gained popularity as a portable, inexpensive, markerless human motion capture device that is easy to develop in software. The Microsoft Kinect sensor is a high-resolution, low-cost depth and visual (RGB) sensing device. With these advantages and its advanced skeletal tracking capabilities, it has become an important tool for clinical assessment, physiotherapy and rehabilitation. This article contains an overview of the evolution of the different versions of Kinect and highlights the differences between their key features. Say no to plagiarism. Get a tailor-made essay on “Why violent video games should not be banned”?Get the original essayKEYWORDS: Computer vision, depth image, information fusion, Kinect sensor.Introduction: Saving three-dimensional information on the geometry of objects or scenes tends to be increasingly applied in the conventional workflow for the documentation and analysis of cultural heritage and archaeological objects or sites. In this particular area of study, we can cite the needs in terms of restoration, conservation, digital documentation, reconstruction or museum exhibitions [1,2]. The scanning process is now greatly simplified with several available techniques that provide 3D data [3]. In the case of large spaces or objects, terrestrial laser scanners (TLS) are preferred because this technology allows a large amount of precise data to be collected very quickly. While trying to reduce costs and working on smaller parts, digital cameras are instead commonly used. They have the advantage of being quite simple to use, thanks to image-based 3D reconstruction techniques [4]. Furthermore, the two methodologies can also be merged in order to overcome their respective limitations and provide more comprehensive models [5,6]. Microsoft Kinect is a device originally designed to detect human movement and developed as a controller for the Xbox gaming console, sold since 2010. It didn't take long for researchers to notice that its applicability extends beyond gaming video, but can be used as a depth sensor that facilitates interaction using gestures and body movements. In 2013, a new Kinect device is introduced with the new gaming console called Kinect v2 or Kinect for Xbox One. The new Kinect replaced old technologies and brought many advancements in system quality and performance. The old Kinect named Kinect v1 or Kinect for Xbox 360 after the arrival of the new Kinect. Although it is classified as a depth camera, the Kinect sensor is much more than that. It has several advanced sensing hardware including a color camera, a depth sensor and a four-microphone array. These sensors guarantee different opportunities in the fields of 3D motion capture, facial and voice recognition [5]. While Kinect for Xbox 360 uses a structured light model to obtain a depth map of a scene, Kinect for Xbox One uses a faster and more accurate TOF sensor. Kinect's skeleton tracking features are used to analyze the movements of thehuman body for applications related to human-machine interaction, motion capture, human activity recognition and other fields. In addition, they are very useful for studies, particularly in physiotherapy and rehabilitation. A cost-effective time-of-flight (TOF) technology with potential application to verifying patient positioning in radiotherapy. In radiotherapy, the patient is initially positioned during the computed tomography (CT) scan simulation, which is then used to create a treatment plan. The treatment plan is designed to deliver a tumoricidal dose at a planning target volume (PTV), which encompasses the severe disease with additional margin to account for configuration uncertainties. Once the treatment plan is approved, patients return for multiple fractions of treatment over a period of days or weeks. Reproducing precise patient positioning between fractions is essential to ensure accurate and effective implementation of the approved treatment plan. The motivation of this survey is to provide a comprehensive and systematic description of popular RGB-D datasets for the convenience of other researchers in this field. Literature survey: Motion capture and depth sensing are two emerging research areas in recent years. With the launch of Kinect in 2010, Microsoft opened the door for researchers to develop, test and optimize algorithms in these two areas. Leyvand T [2] discussed Kinect technology. His work sheds light on how a person's identity is tracked by the Kinect sensor for XBox 360. Some information on how changes occur in technology over time is also presented. With the launch of Kinect, a step change in identification and tracking techniques is expected. They discussed possible challenges in the coming years in the field of identification and tracking of games and Kinect sensors. Kinect identification is accomplished in two ways: biometric login and session tracking. They considered that players do not change their clothes or rearrange their hairstyle, but they change their facial expressions, give different poses, etc. He sees the biggest challenge to Kinect's success as the accuracy factor, both in terms of measurement and regression. The main perspective of the method is that they consider a single depth image and use an object recognition approach. From a single input depth image, they inferred a per-pixel distribution of body parts. Depth imaging refers to calculating the depth of each pixel as well as RGB image data. The Kinect sensor provides real-time depth data in isochronous mode[18]. So, in order to correctly track the movement, each depth flow must be processed. The depth camera offers many advantages over the traditional camera. It can operate in low light conditions and is color invariant [1]. Depth sensing can be performed either via time-of-flight laser sensing or structured light patterns combined with stereo sensing [9]. The proposed system uses the stereo detection technique provided by PrimeSense [21]. Kinect depth sensing works in real time with greater precision than any other depth sensing camera currently available. The Kinect depth-sensing camera uses a laser beam to predict the distance between the object and the sensor. The technology behind this system is that the CMOS image sensor is directly connected to theSocket-on-Chip [21]. Additionally, a sophisticated decryption algorithm (not published by PrimeSense) is used to decrypt the input depth data. Research Proposal: Due to their appeal and imaging capabilities, much work has been devoted to RGB-D cameras over the past decade. The objective of this section is to present the state of the art related to this technology, considering aspects such as areas of application, calibration methods or metrological approaches. The application areas of RGB-D cameras constitute a wide range of applications that can be explored by considering RGB-D cameras. The main advantages are the cost, which is mostly low compared to laser scanners, but also their high portability which allows use on board mobile platforms. Towards 3D modeling of objects with an RGB-D camera, the creation of 3D models represents a common and interesting solution for the documentation and visualization of heritage and archaeological materials. Due to its remarkable results and its affordable price, the technique probably most used by the archaeological community remains photogrammetry. Error sources and calibration methods constitute the main problem when working with ToF cameras, because the measurements made are distorted by several phenomena. To guarantee the reliability of the point clouds acquired, particularly for the purpose of precise 3D modeling, a prior removal of these distortions must be carried out. To do this, a good knowledge of the multiple sources of error that affect measurements is useful. Future Outlook: Analyzing the above articles, we believe that there are definitely many future works in this research community. Here we discuss potential ideas for each of the main vision topics separately. Object tracking and recognition comes from background subtraction based on depth images which can easily solve the practical problems that have hampered object tracking and recognition for a long time. It wouldn't be surprising if tiny devices equipped with Kinect-like RGB and depth cameras appear in normal office environments in the near future. However, the limited range of the depth camera may not allow its use for standard indoor surveillance applications. To solve this problem, combining multiple Kinects can be a potential solution. This will of course require communication between the Kinects and re-identification of objects across different views. Human activity analysis results in a reliable algorithm capable of estimating complex human poses (such as gymnastic or acrobatic poses) and poses of people in close interaction will certainly be active topics in the future. For activity recognition, further research on low-latency systems, such as the system described in, may become the trend in this field as more and more practical applications require online recognition. Hand gesture analysis shows that many approaches avoid the problem of detecting hands from a realistic situation by assuming that hands are the closest objects to the camera. These methods are experimental and their use is limited to laboratory environments. In the future, methods that can handle arbitrary and high-degree-of-freedom hand movements in realistic situations may attract more attention. Additionally, there is a dilemma between shape-based and 3D model-based methods. The first allows ahigh-speed operation with a loss of generality while the latter provides generality at a higher computational power cost. Therefore, balance and compromise between them will become an active topic. According to the evaluation results of the most current approaches, indoor 3D mapping fails when erroneous edges are created during mapping. Therefore, methods that can detect bad edges and repair them autonomously will be very useful in the future. In sparse feature-based approaches, it may be necessary to optimize the keypoint matching scheme, either by adding a feature lookup table or by eliminating non-matching features. In dense point matching approaches, it is worth trying to reconstruct larger scenes, such as the interior of an entire building. Here, more memory-efficient representations will be needed. Design methodology: Our system implements augmented reality using the processing capabilities of Kinect. The system consists of 4 main components: the tracking device, the processing device, the input device and the display device. We use Kinect as a tracking device. It contains three sensors for processing depth images, RGB images and voice. Kinect's depth camera and multi-array mic are used to capture the image stream and audio data in real time, respectively. The depth sensor is used to obtain the distance between the sensor and the tracking object. The input device of our setup is a high definition camera which is used to obtain the input image stream and run in the background for all augmented reality components. On top of this background stream, we overlay event-specific 3D models to provide a virtual reality experience. The processing device, consisting of the data processing unit, the audio unit and the associated software, takes care of which model to overlay at what time. The processing unit transmits the input video stream and the 3D model to the display device for viewing. The Kinect system plays an important role in the functioning of the overall system. This system works as a tracking unit for the augmented reality system. This system uses some of Kinect's most exciting features such as skeletal tracking, joint estimation, and voice recognition of a human body. Skeletal tracking is useful for determining the user's position from Kinect, when the user is in frame, which will be used to guide them through the stitching procedure. Additionally, it makes gesture recognition easier. This system guides the user through the complete assembly of the product using voice and gesture recognition. Product assembly involves bringing together the individual component parts and putting them together as a product. There are two assembly modes for this system: Full Assembly and Part Assembly. In Full Assembly mode, Kinect will guide the technician on how to assemble an entire product sequentially. This mode will be useful when the entire product needs to be assembled. In Part Assembly mode, the technician must select a part to assemble, then Kinect will guide them on how to assemble a selected part. When assembly of this part is complete, the technician can select another part or exit. This mode will be useful when one or more parts need to be assembled. The system has been developed to operate in 2 modes, voice mode and gesture mode. The choice ofSelecting a mode was given to the user based on their familiarity with the system and convenience of use. If the user has opted for voice mode, he must use voice commands to interact with the system and the system will guide him through the voice commands. On the other hand, if the user has opted for gesture mode, he must use gesture to interact with the system and the system will guide him through voice commands. The 'START' command is used in both modes to start the system. After launching the system, the user will select a voice mode or gesture mode and continue working in the same way. Software: Dot Net Technology Hardware Kinect: The Kinect sensor, the first low-cost depth camera, was introduced by Microsoft in November 2010. First, it was generally a motion-controlled gaming device. Then a new version for Windows was extended. Here in this section we will discuss the evolution of Kinect from v1 to the recent version v2. Kinect v1: Microsoft Kinect v1 was released in February 2012 and began to compete with several other motion controllers available in the market. Kinect's hardware consists of a sensor bar including 3D depth sensors, an RGB camera, a multi-array microphone, and a motorized pivot. The sensor enables full-body 3D motion capture, facial recognition and voice recognition. The depth sensor consists of an IR spotlight and an IR camera, which is a monochrome complementary metal oxide semiconductor (CMOS) sensor. The IR projector projects an IR laser which passes through a diffraction grating and transforms into a set of IR dots. Projected points in the 3D scene are invisible to the color camera but visible to the IR camera. The relative left-right translation of the dot pattern gives the depth of a dot. Kinect v2: Microsoft Kinect v1 was upgraded to v2 in November 2013. The second generation Kinect v2 is completely different based on its ToF technology. Its basic principle is that a set of transmitters sends a modulated signal which propagates to the measured point, is reflected and received by the sensor's CCD. The sensor acquires a depth map of 512*424 and an RGB image of 1920*1080 at a rate of 15 to 30 frames per second. Kinect Software: OpenKinect is a free, open source library maintained by an open community of Kinect people. The majority of users use the first two libraries, namely OpenNI and Microsoft SDK. The Microsoft SDK is only available for Windows while OpenNI is a cross-platform and open source tool. Microsoft Kinect includes free downloadable software, which is a Kinect development library tool. Practical Experiments: Kinect, in this article, refers to both advanced RGB/depth sensing hardware and software technology that interprets RGB/depth signals. The hardware contains a normal RGB camera, a depth sensor and an array of four microphones, capable of providing depth signals, RGB images and audio signals simultaneously. When it comes to software, several tools are available, allowing users to develop products for various applications. These tools provide features to synchronize image signals, capture 3D human motion, identify human faces, and recognize human voice, among others. Here, human voice recognition is achieved by a remote voice recognition technique, thanks to recent advances in surround sound echo cancellation and microphone array processing. More details about Kinect audio processing can be.