利用 iOS 14 Vision 的手势估测功能　实作无接触即可滑动的 Tinder App

Vision 框架在 2017 年推出，目的是为了让行动 App 开发者轻松利用电脑视觉演算法。具体来说，Vision 框架中包含了许多预先训练好的深度学习模型，同时也能充当包裹器 (wrapper) 来快速执行你客制化的 Core ML 模型。

Apple 在 iOS 13 推出了文字辨识 (Text Recognition) 和 VisionKit 来增强 OCR 之后，现在将重点转向了 iOS 14 Vision 框架中的运动与动作分类上。

在之前的文章中，我们说过 Vision 框架可以做轮廓侦测 (Contour Detection)、光流请求 (Optical Flow Request)，并提供一系列离线影片处理 (offline video processing) 的工具。不过更重要的是，我们现在可以进行手部与身体姿势估测 (Hand and Body Pose Estimation) ，这无疑为扩增实境 (augmented reality) 与电脑视觉带来了更多可能性。

在这篇文章中，我们会以手势估测功能来建构一个 iOS App，在无接触 (touchless) 的情况下，App 也能够感应手势。

我之前已经发表过一篇文章，展示如何使用 ML Kit 的脸部侦测 API，来建构无接触滑动的 iOS App。我觉得这个雏型 (prototype) 非常好用，可以整合到像是 Tinder 或 Bumble 等这种约会 App 中。不过，这种方式可能会因为持续眨眼和转动头部，而造成眼睛疲劳或头痛。

因此，我们简单地扩展这个范例，透过手势代替触摸，来往左或往右滑动。毕竟近年来说，使用手机来生活得更懒惰、或是练习社交距离也是合理的。在我们深入研究之前，先来看看如何在 iOS 14 中创建一个视觉手势请求。

视觉手势估测

这个新的 VNDetectHumanHandPoseRequest，是一个基于影像的视觉请求，用来侦测一个人的手势。在型别为 VNHumanHandPoseObservation 的实例当中，这个请求会在每隻手上回传 21 个标记点 (Landmark Point)。我们可以设定 maximumHandCount 数值，来控制在视觉处理过程之中，每张帧最多可以侦测的数量。

我们可以简单地在实例中如此使用列举 (enum)，来获得每隻手指的标记点阵列 (array)：

try observation.recognizedPoints(.thumb)

try observation.recognizedPoints(.indexFinger)

try observation.recognizedPoints(.middleFinger)

try observation.recognizedPoints(.ringFinger)

try observation.recognizedPoints(.littleFinger)

这裡也有一个手腕的标记点，位置就在手腕的中心点位置。它并不属于上述的任何群组，而是在 all群组之中。你可以透过下列方式获得它：

let wristPoints = try observation.recognizedPoints(.all)

我们拿到上述的标记点阵列后，就可以这样将每个点独立抽取出来：

guard  let thumbTipPoint = thumbPoints[.thumbTip],

let indexTipPoint = indexFingerPoints[.indexTip],

let middleTipPoint = middleFingerPoints[.middleTip],

let ringTipPoint = ringFingerPoints[.ringTip],

let littleTipPoint = littleFingerPoints[.littleTip],

let wristPoint = wristPoints[.wrist]else  {return}

thumbIP、thumbMP、thumbCMC 是可以在 thumb 群组中获取的其他标记点，这也适用于其他手指。

hand-landmarks

每个独立的标记点物件，都包含了它们在 AVFoundation 座标系统中的位置及 confidence 阀值 (threshold)。

接著，我们可以在点跟点之间找到距离或角度的资讯，来创建手势处理器。举例来说，在 Apple 的范例 App 中，他们计算拇指与食指指尖的距离，来创建一个捏 (pinch) 的手势。

开始动工

现在我们已经了解视觉手势请求的基础知识，可以开始深入研究如何实作了！

开启 Xcode 并创建一个新的 UIKit App，请确认你有将开发目标设定为 iOS 14，并在 Info.plist 设置 NSCameraUsageDescription 字串。

xcode-setting

我在前一篇文章介绍过如何建立一个带有动画的 Tinder 样式卡片，现在可以直接参考当时的最终程式码。

同样地，你可以在这裡参考 StackContainerView.swift 类别的程式码，这个类别是用来储存多个 Tinder 卡片的。

利用 AVFoundation 设置相机

接下来，让我们利用 Apple 的 AVFoundation 框架来建立一个客制化相机。

以下是 ViewController.swift 档案的程式码：

class ViewController: UIViewController, HandSwiperDelegate{    
    //MARK: - Properties
    var modelData = [DataModel(bgColor: .systemYellow),
                         DataModel(bgColor: .systemBlue),
                         DataModel(bgColor: .systemRed),
                         DataModel(bgColor: .systemTeal),
                         DataModel(bgColor: .systemOrange),
                         DataModel(bgColor: .brown)]
    var stackContainer : StackContainerView!
    var buttonStackView: UIStackView!
    var leftButton : UIButton!, rightButton : UIButton!
    var cameraView : CameraView!
    //MARK: - Init
    override func loadView() {
        view = UIView()
        stackContainer = StackContainerView()
        view.addSubview(stackContainer)
        configureStackContainer()
        stackContainer.translatesAutoresizingMaskIntoConstraints = false
        addButtons()
        configureNavigationBarButtonItem()
        addCameraView()
    }
    override func viewDidLoad() {
        super.viewDidLoad()
        title = "HandPoseSwipe"
        stackContainer.dataSource = self
    }
    private let videoDataOutputQueue = DispatchQueue(label: "CameraFeedDataOutput", qos: .userInteractive)
    private var cameraFeedSession: AVCaptureSession?
    private var handPoseRequest = VNDetectHumanHandPoseRequest()
    let message = UILabel()
    var handDelegate : HandSwiperDelegate?
    func addCameraView()
    {
        cameraView = CameraView()
        self.handDelegate = self
        view.addSubview(cameraView)
        cameraView.translatesAutoresizingMaskIntoConstraints = false
        cameraView.bottomAnchor.constraint(equalTo: view.bottomAnchor).isActive = true
        cameraView.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
        cameraView.widthAnchor.constraint(equalToConstant: 150).isActive = true
        cameraView.heightAnchor.constraint(equalToConstant: 150).isActive = true
    }
    //MARK: - Configurations
    func configureStackContainer() {
        stackContainer.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
        stackContainer.centerYAnchor.constraint(equalTo: view.centerYAnchor, constant: -60).isActive = true
        stackContainer.widthAnchor.constraint(equalToConstant: 300).isActive = true
        stackContainer.heightAnchor.constraint(equalToConstant: 400).isActive = true
    }
    func addButtons()
    {
        //full source of UI setup at the end of this article
    }
    @objc func onButtonPress(sender: UIButton){
        UIView.animate(withDuration: 2.0,
                                   delay: 0,
                                   usingSpringWithDamping: CGFloat(0.20),
                                   initialSpringVelocity: CGFloat(6.0),
                                   options: UIView.AnimationOptions.allowUserInteraction,
                                   animations: {
                                    sender.transform = CGAffineTransform.identity
                                   },
                                   completion: { Void in()  })
        if let firstView = stackContainer.subviews.last as? TinderCardView{
            if sender.tag == 0{
                firstView.leftSwipeClicked(stackContainerView: stackContainer)
            }
            else{
                firstView.rightSwipeClicked(stackContainerView: stackContainer)
            }
        }
    }
    func configureNavigationBarButtonItem() {
        navigationItem.rightBarButtonItem = UIBarButtonItem(title: "Reset", style: .plain, target: self, action: #selector(resetTapped))
    }
    @objc func resetTapped() {
        stackContainer.reloadData()
    }
    override func viewDidAppear(_ animated: Bool) {
        super.viewDidAppear(animated)
        do {
            if cameraFeedSession == nil {
                cameraView.previewLayer.videoGravity = .resizeAspectFill
                try setupAVSession()
                cameraView.previewLayer.session = cameraFeedSession
            }
            cameraFeedSession?.startRunning()
        } catch {
            AppError.display(error, inViewController: self)
        }
    }
    override func viewWillDisappear(_ animated: Bool) {
        cameraFeedSession?.stopRunning()
        super.viewWillDisappear(animated)
    }
    func setupAVSession() throws {
        // Select a front facing camera, make an input.
        guard let videoDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .front) else {
            throw AppError.captureSessionSetup(reason: "Could not find a front facing camera.")
        }
        guard let deviceInput = try? AVCaptureDeviceInput(device: videoDevice) else {
            throw AppError.captureSessionSetup(reason: "Could not create video device input.")
        }
        let session = AVCaptureSession()
        session.beginConfiguration()
        session.sessionPreset = AVCaptureSession.Preset.high
        // Add a video input.
        guard session.canAddInput(deviceInput) else {
            throw AppError.captureSessionSetup(reason: "Could not add video device input to the session")
        }
        session.addInput(deviceInput)
        let dataOutput = AVCaptureVideoDataOutput()
        if session.canAddOutput(dataOutput) {
            session.addOutput(dataOutput)
            // Add a video data output.
            dataOutput.alwaysDiscardsLateVideoFrames = true
            dataOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)]
            dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
        } else {
            throw AppError.captureSessionSetup(reason: "Could not add video data output to the session")
        }
        session.commitConfiguration()
        cameraFeedSession = session
    }
}

在上面的程式码中包含了许多步骤，让我们一一来分析：

CameraView 是一个客制化的 UIView 类别，用来在画面上呈现相机的内容。之后我们会进一步讲解这个类别。
我们会在 setupAVSession() 设置前置相机镜头，并将它设置为 AVCaptureSession 的输入。
接著，我们在 AVCaptureVideoDataOutput 上呼叫 setSampleBufferDelegate。

而 ViewController 类别要遵循 HandSwiperDelegate 协定：

protocol HandSwiperDelegate {
  func thumbsDown()
  func thumbsUp()
}

当侦测到手势后，我们将会触发相对应的方法。现在，让我们来看看要如何在捕捉到的影像中执行视觉请求。

在捕捉到的影像中执行视觉手势请求

在以下程式码中，我们为上述的 ViewController 创建了一个扩展 (extension)，而这个扩展遵循 AVCaptureVideoDataOutputSampleBufferDelegate 协定：

extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        var thumbTip: CGPoint?
        var wrist: CGPoint?
        let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .up, options: [:])
        do {
            // Perform VNDetectHumanHandPoseRequest
            try handler.perform([handPoseRequest])
            guard let observation = handPoseRequest.results?.first else {
                cameraView.showPoints([])
                return
            }
            // Get points for all fingers
            let thumbPoints = try observation.recognizedPoints(.thumb)
            let wristPoints = try observation.recognizedPoints(.all)
            let indexFingerPoints = try observation.recognizedPoints(.indexFinger)
            let middleFingerPoints = try observation.recognizedPoints(.middleFinger)
            let ringFingerPoints = try observation.recognizedPoints(.ringFinger)
            let littleFingerPoints = try observation.recognizedPoints(.littleFinger)
            // Extract individual points from Point groups.
            guard let thumbTipPoint = thumbPoints[.thumbTip],
                  let indexTipPoint = indexFingerPoints[.indexTip],
                  let middleTipPoint = middleFingerPoints[.middleTip],
                  let ringTipPoint = ringFingerPoints[.ringTip],
                  let littleTipPoint = littleFingerPoints[.littleTip],
                  let wristPoint = wristPoints[.wrist]
            else {
                cameraView.showPoints([])
                return
            }
            let confidenceThreshold: Float = 0.3
            guard   thumbTipPoint.confidence > confidenceThreshold &&
                    indexTipPoint.confidence > confidenceThreshold &&
                    middleTipPoint.confidence > confidenceThreshold &&
                    ringTipPoint.confidence > confidenceThreshold &&
                    littleTipPoint.confidence > confidenceThreshold &&
                    wristPoint.confidence > confidenceThreshold
            else {
                cameraView.showPoints([])
                return
            }
            // Convert points from Vision coordinates to AVFoundation coordinates.
            thumbTip = CGPoint(x: thumbTipPoint.location.x, y: 1 - thumbTipPoint.location.y)
            wrist = CGPoint(x: wristPoint.location.x, y: 1 - wristPoint.location.y)
            DispatchQueue.main.async {
                self.processPoints([thumbTip, wrist])
            }
        } catch {
            cameraFeedSession?.stopRunning()
            let error = AppError.visionError(error: error)
            DispatchQueue.main.async {
                error.displayInViewController(self)
            }
        }
    }
}

值得注意的是，VNObservation 所回传的标记点是属于 Vision 座标系统的。我们必须将它们转换成 UIKit 座标，才能将它们绘制在萤幕上。

因此，我们透过以下方式将它们转换为 AVFoundation 座标：

wrist = CGPoint(x: wristPoint.location.x, y: 1 – wristPoint.location.y)

接著，我们将会把这些标记点传递给 processPoints 函式。为了精简流程，这裡我们只用了拇指指尖与手腕两个标记点来侦测手势。

以下是 processPoints 函式的程式码：

func processPoints(_ points: [CGPoint?]) {
        let previewLayer = cameraView.previewLayer
        var pointsConverted: [CGPoint] = []
        for point in points {
            pointsConverted.append(previewLayer.layerPointConverted(fromCaptureDevicePoint: point!))
        }
        let thumbTip = pointsConverted[0]
        let wrist = pointsConverted[pointsConverted.count - 1]
        let yDistance  = thumbTip.y - wrist.y
        if(yDistance > 50){
            if self.restingHand{
                self.restingHand = false
                self.handDelegate?.thumbsDown()
            }
        }else if(yDistance < -50){
            if self.restingHand{
                self.restingHand = false
                self.handDelegate?.thumbsUp()
            }
        }
        else{
            self.restingHand = true
        }
        cameraView.showPoints(pointsConverted)
    }

我们可以利用以下这行程式码，将 AVFoundation 座标转换为 UIKit 座标：

previewLayer.layerPointConverted(fromCaptureDevicePoint: point!)

最后，我们会依据两个标记点之间的绝对阈值距离，触发对推叠卡片往左或往右滑动的动作。

我们利用 cameraView.showPoints(pointsConverted)，在 CameraView 子图层上绘制一条连接两个标记点的直线。

以下是 CameraView 类别的完整程式码：

import UIKit
import AVFoundation
class CameraView: UIView {
    private var overlayThumbLayer = CAShapeLayer()
    var previewLayer: AVCaptureVideoPreviewLayer {
        return layer as! AVCaptureVideoPreviewLayer
    }
    override class var layerClass: AnyClass {
        return AVCaptureVideoPreviewLayer.self
    }
    override init(frame: CGRect) {
        super.init(frame: frame)
        setupOverlay()
    }
    required init?(coder: NSCoder) {
        super.init(coder: coder)
        setupOverlay()
    }
    override func layoutSublayers(of layer: CALayer) {
        super.layoutSublayers(of: layer)
        if layer == previewLayer {
            overlayThumbLayer.frame = layer.bounds
        }
    }
    private func setupOverlay() {
        previewLayer.addSublayer(overlayThumbLayer)
    }
    func showPoints(_ points: [CGPoint]) {
        guard let wrist: CGPoint = points.last else {
            // Clear all CALayers
            clearLayers()
            return
        }
        let thumbColor = UIColor.green
        drawFinger(overlayThumbLayer, Array(points[0...1]), thumbColor, wrist)
    }
    func drawFinger(_ layer: CAShapeLayer, _ points: [CGPoint], _ color: UIColor, _ wrist: CGPoint) {
        let fingerPath = UIBezierPath()
        for point in points {
            fingerPath.move(to: point)
            fingerPath.addArc(withCenter: point, radius: 5, startAngle: 0, endAngle: 2 * .pi, clockwise: true)
        }
        fingerPath.move(to: points[0])
        fingerPath.addLine(to: points[points.count - 1])
        layer.fillColor = color.cgColor
        layer.strokeColor = color.cgColor
        layer.lineWidth = 5.0
        layer.lineCap = .round
        CATransaction.begin()
        CATransaction.setDisableActions(true)
        layer.path = fingerPath.cgPath
        CATransaction.commit()
    }
    func clearLayers() {
        let emptyPath = UIBezierPath()
        CATransaction.begin()
        CATransaction.setDisableActions(true)
        overlayThumbLayer.path = emptyPath.cgPath
        CATransaction.commit()
    }
}

最终成果

最终 App 的成果会是这样：

vision-framework-hand-pose-estimation-demo

结论

我们可以在许多情况下用到 Vision 新的手势估测请求，包括利用手势来进行自拍、绘制签名，甚至是辨识川普在演讲当中不同的手势。

你也可以将视觉请求与身体姿势请求串接在一起，用来建构更複杂的姿态。

你可以在 Github 储存库参考这个专案的完整程式码。

这篇文章到此为止，感谢你的阅读！