Swipeless Tinder Using iOS 14 Vision Hand Pose Estimation

Let’s use the power of computer vision to detect hand gestures in iOS

Published in

Better Programming

5 min readJan 5, 2021

The introduction of iOS 14 brought in a slew of enhancements and interesting new features in Apple’s computer vision framework.

Vision framework was released in 2017 in a bid to allow mobile application developers to leverage complex computer vision algorithms with ease. Specifically, the framework incorporates a host of pre-trained deep learning models whilst also acting as a wrapper to quickly run your own custom Core ML models.

After the introduction of Text Recognition and VisionKit in iOS 13 to boost OCR, Apple shifted its focus towards sports and action classification in iOS 14’s Vision framework.

Primarily, the Vision framework now lets you do Contour Detection, Optical Flow Request and includes a bunch of new utilities for offline video processing. But more importantly, we can now do Hand and Body Pose Estimation — which certainly opens the door for new possibilities in augmented reality and computer vision.

In this article, we’re focusing on Hand Pose estimation to build an iOS app that lets you perform touchless finger gestures.

If you’ve been following my pieces, I’ve already demonstrated how to Build a Touchless Swipe iOS App Using ML Kit’s Face Detection API. I felt that prototype was cool to integrate into dating apps like Tinder, Bumble, and more. But at the same time, it could cause eye strains and headaches due to the blinks and turns.

So, we’ll simply extend that use case by using hand pose gestures instead to swipe left or right — because in 2020, it's OK to be lazy and practice social distancing with our phones. Before we dive into the deep-end, let’s look at how to create a Vision Hand Pose Request in iOS 14.

Vision Hand Pose Estimation

The new VNDetectHumanHandPoseRequest is an image-based vision request that detects a human hand pose. It returns 21 landmark points on each hand in an instance of the type:VNHumanHandPoseObservation. We can set the maximumHandCount to be detected in each frame during the Vision processing.

To get the points array of each finger, we’ll simply invoke the enum on the instance in the following way:

try observation.recognizedPoints(.thumb)
try observation.recognizedPoints(.indexFinger)
try observation.recognizedPoints(.middleFinger)
try observation.recognizedPoints(.ringFinger)
try observation.recognizedPoints(.littleFinger)

There’s also a wrist landmark that’s located on the center of the wrist and is not part of any of the above groups. Instead, it falls in the all group and can be retrieved in the following way:

let wristPoints = try observation.recognizedPoints(.all)

Once we’ve got the above points array, we can extract the individual points in the following way:

guard let thumbTipPoint = thumbPoints[.thumbTip],
let indexTipPoint = indexFingerPoints[.indexTip],
let middleTipPoint = middleFingerPoints[.middleTip],
let ringTipPoint = ringFingerPoints[.ringTip],
let littleTipPoint = littleFingerPoints[.littleTip],
let wristPoint = wristPoints[.wrist]else {return}

thumbIP, thumbMP, thumbCMC are the other individual points that you can retrieve from the thumb’s point group (and so on for the other fingers).

Each of the individual point objects contains the location in an AVFoundation coordinate-system along with their confidence threshold.

Subsequently, we can find distances or angles between points to create certain gesture processors. For instance, in Apple’s demo application, they’ve created a pinch gesture by calculating the distance between thumb and index tip points.

Getting Started

Now that we’re done with the basics of Vision Hand Pose Request, let's dive into the implementation.

Launch your Xcode and create a new UIKit application. Make sure you’ve selected the deployment target as iOS 14 and have set the NSCameraUsageDescription string in the Info.plist.

Since we’ve already covered how to create Tinder-esque cards with animation, here’s the final code for that class.

Similarly, here’s the code for the StackContainerView.swift class that holds the bunch of Tinder cards.

Setting Up Our Camera Using AVFoundation

Next up, let’s create our own custom camera using Apple’s AVFoundation framework.

Here’s the code for the ViewController.swift file:

There’s a lot happening in the above code. Let’s break it down.

CameraView is a custom UIView class that displays the camera contents on the screen. We’ll come to it shortly.
setupAVSession() is where we’re setting up the front-facing camera and adding it as the input to the AVCaptureSession.
Subsequently, we’ve invoked the setSampleBufferDelegate on the AVCaptureVideoDataOutput.

The ViewController class conforms to HandSwiperDelegate protocol:

protocol HandSwiperDelegate {
  func thumbsDown()
  func thumbsUp()
}

We’ll trigger the respective method when the hand gesture is detected. Now, let’s look at how to run a Vision request on the captured frames.

Running Vision Hand Pose Request On Captured Frames

In the following code, we’ve created an extension of our above ViewController which conforms to AVCaptureVideoDataOutputSampleBufferDelegate:

It’s worth noting that the points returned by the VNObservation belong to the Vision coordinate system. We need to convert them to the UIKit coordination to eventually draw them on the screen.

So, we’ve converted them into the AVFoundation coordinate system in the following way:

wrist = CGPoint(x: wristPoint.location.x, y: 1 - wristPoint.location.y)

Subsequently, we’ll pass these points in the processPoints function. For the sake of simplicity, we’re using just two landmarks — thumb tip and wrist — to detect the hand gestures.

Here’s the code for the processPoints function:

The following line of code converts the AVFoundation coordinate system to the UIKit coordinates:

previewLayer.layerPointConverted(fromCaptureDevicePoint: point!)

Finally, based on the absolute threshold distance between the two points, we trigger the respective left swipe or right swipe action on the stack of cards.

cameraView.showPoints(pointsConverted) draws a line between the two points on the CameraView sublayer.

Here’s the full code of the CameraView class:

Final Output

The output of the application in action is given below:

Conclusion

From gesture-based selfie clicks to drawing signatures to finding the different hand gestures people make in videos, there are so many ways you can take advantage of Vision’s new Hand Pose Estimation request.

One could also chain this Vision request with, say, a body pose request, to build complex gestures.

The full source code of the above project is available in the GitHub Repository.

That’s it for this one. Thanks for reading!

Better Programming

Swipeless Tinder Using iOS 14 Vision Hand Pose Estimation

Let’s use the power of computer vision to detect hand gestures in iOS

Vision Hand Pose Estimation

Getting Started

Setting Up Our Camera Using AVFoundation

Running Vision Hand Pose Request On Captured Frames

Final Output

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Better Programming

Written by Anupam Chugh

Responses (1)