This is probably just using simple 2D feature based recognition (SIFT, SURF, etc), but the problem of recognizing yourself is a deeper issue which is related to self awareness and theory of mind. If any creature wants to survive it needs to be able to do things such as detect damage to itself, and this involves having a self model which is learned through early interactions with the world, such as play behaviors, together with lifelong habituation. In humans, and presumably also other animals, this can lead to curious phenomena such as phantom limb syndrome. Currently in systems such as ROS the model of the robot is something which is hand coded by traditional engineering, but in a hypothetical AGI system the model would be learned via self observation and interaction in the environment.
Qbo also uses stereo vision rather than an RGBD sensor. Stereo vision has been a classic problem in computer science, due to the high ambiguity of feature matching, but current dense stereo algorithms have seen great improvements in recent years to the point where the results are similar to those which the Kinect can produce, especially at close range.
Example depth maps from stereo vision are shown in these videos - one of me, and another of a cup on a table. I don't know whether Qbo uses this type of algorithm, but it's one of the open source implementations which are available for use in robotics projects.