iOS 14 Vision Body Pose Detection: Count Squat Reps in a SwiftUI Workout App

Build an AI fitness SwiftUI app using Apple’s Vision framework to detect human body poses

Published in

Better Programming

6 min readJun 16, 2021

Woman doing squats — Photo by Sergio Pedemonte on Unsplash.

In this article, we’ll explore a mobile ML use case and build a practical application around Apple’s VNDetectHumanBodyPoseRequest. This pre-trained model was presented during WWDC20 and is a powerful tool for the real-time detection of body points.

During the pandemic, the at-home fitness industry has been booming, so our app will be a sort of home workout mirror that counts squat repetitions and provides useful clues regarding our posture.

GIF of man performing squats — GIF created by the author

Before we dive into the code, let’s spend a few thoughts on the model that we’re using for this task. The VNDetectHumanBodyPoseRequest returns the 2D coordinates for 19 different positions in a given picture. We’ll implement this request in a SwiftUI app and then build a logic-based approach that looks for changes in our body position. You might now be wondering why we are not using the Create ML Action Classifier instead. We could capture a few squat videos and Create ML would train a custom model with the body positions as a feature. This would work to tell whether we’re squatting. The problem is that it requires a fixed length of video input. If, for example, we set the capture length to three seconds, we would get a prediction for that time window but no count of the squats. So we’ll rather just build up our own little logic on top of VNDetectHumanBodyPoseRequest.

Go ahead and create a new SwiftUI project. The first step will be to access the front camera of our iPhone. As SwiftUI doesn’t come with direct camera access, we will use an UIKit view controller with an AVCaptureSession. I will not provide too many details for this step, as there are already endless tutorials on how to capture the camera output in SwiftUI. In our app, we also want to show the live preview from our camera, so we make the UIViewController manage a UIView with a root layer of type AVCaptureVideoPreviewLayer.

In the viewDidAppear call of the view controller, we initialize the AVCaptureSession and set the front camera as input. As our model works on single images, we need to grab sample frames from the video output. This is done via the SampleBufferDelegate, which is also our handover point to SwiftUI.

We wrap the CameraViewController in an UIViewControllerRepresentable and assign a variable of type PoseEstimator as our delegate for receiving sample frames.

Now let’s look at the details of PoseEstimator. It implements captureOutput to conform to AVCaptureVideoDataOutputSampleBufferDelegate. The code is pretty straightforward. As we’ll be receiving a series of images, we need to create a VNSequenceRequestHandler and make it perform a VNDetectHumanBodyPoseRequest on each image.

The result is received by the detectedBodyPose function that grabs the first observation and assigns it to the bodyParts variable. This dictionary is of type [VNHumanBodyPoseObservation.JointName : VNRecognizedPoint] and contains a confidence score and position of 19 predefined body points.

Nice. In just a few lines of code, we are now performing real-time inference with front camera images and have a publisher that constantly sends new predictions for the recognized body points.

19 body points — Source: Apple Developer documentation

With the ML portion of our app implemented, we‘ll now use the recognized body parts to build two features:

Draw a stick figure from the body points and lay it over the camera view.
Count squats and check body posture.

For the stick figure, we first need a way to draw lines between the body points. This can be done with SwiftUI shapes. Remember that so far, we are using normalized points that were returned from Vision. When drawing the shape, we need to scale and translate those points to match the camera view. The size variable is the frame size of the camera view and will be handed down from the top-level view.

In a new view, we draw sticks for all body parts. Initially, the bodyParts variable is an empty dictionary, so we need to check if it has been filled with inference results. At this point, we could also check for confidence scores to avoid drawing inaccurate lines or customize the size/color of the lines. But we will keep it simple and draw every line in green. Below is an example for the right leg:

Now let’s add both the CameraViewWrapper and the StickFigure to ContentView. We give the ZStack the frame ratio of the video output (1920x1080) to keep a correct aspect ratio. Before running the app, we need to add Privacy Camera Usage Description to the .plist of our app. Then place the phone somewhere on the ground and you will see a green stick figure on top of your full-body selfie.

Man performing squats — GIF created by the author

It’s time to add the logic for counting the repetitions and checking our posture. In order to count the squats, we need to know when we are in the lower or upper position. By looking at the clip above, we can see that the following values change during the squat movement:

Height of upper body compared to lower legs
Angle at hip joints
Angle at knee joints

Luckily, we can both compare the height of different CGPoints and use trigonometry to calculate angles. In our example, we compare the height of the hip to the height of the knee. If the hips are under the knees, we assume to be in the lower squat position. For the upper squat position, we cannot compare heights, so we use the knee angle instead. Let’s say that a knee angle of more than 150 degrees indicates that the legs are extended, which suggests being in the upper position. Add the following function to the PoseEstimator class:

Finally, let’s have the app check our posture during the squat. I’m not a fitness expert, but I was once told that I should make sure that my knees are wider than my ankles. That’s actually very easy to do with the coordinates of the given body parts. We could compare the x coordinates of the points similarly to what we did for the height of the hips and knees. But the phone might be tilted, so we will use a third technique that considers both the x and y coordinates. In an extension to CGPoint, we define a function that uses the Pythagorean theorem to calculate the distance between two points.

We add a new variable, isGoodPosture, and compare the distances between the knee and ankle joints.

The countSquats function now needs to be called every time we get a new prediction of bodyParts. We use Combine for this because it allows us to drop the initial value of the publisher and avoid problems during the force-unwrapping of the locations. Add the following to the PoseEstimator class:

We’re almost done. The last step is to add an HStack right under the camera view. A red alert triangle will pop up whenever the posture is not good.

Go ahead and run the app. We now have our little AI fitness coach implemented. This is definitely far from being a production-ready app, but I hope you get an idea of how to analyze body movements in SwiftUI. A 3D pose estimation model would provide even more potential for analysis but would probably not run in real-time.

Let me know which other use cases you can come up with. Thanks for reading.

The full code is available here: