Deep Dojo

MLModel API

Generated Code

When you take a small Core ML model like Squeeze Net and drop it into Xcode, you get a SqueezeNet implementation. Without comments, the class itself is a only a few dozen lines in length.

class SqueezeNet {
    var model: MLModel

    init(contentsOf url: URL) throws {
        self.model = try MLModel(contentsOf: url)
    }

    convenience init() {
        let bundle = Bundle(for: SqueezeNet.self)
        let assetPath = bundle.url(forResource: "SqueezeNet", withExtension:"mlmodelc")
        try! self.init(contentsOf: assetPath!)
    }

    func prediction(input: SqueezeNetInput) throws -> SqueezeNetOutput {
        let outFeatures = try model.prediction(from: input)
        let result = SqueezeNetOutput(classLabelProbs: outFeatures.featureValue(for: "classLabelProbs")!.dictionaryValue as! [String : Double], classLabel: outFeatures.featureValue(for: "classLabel")!.stringValue)
        return result
    }

    func prediction(image: CVPixelBuffer) throws -> SqueezeNetOutput {
        let input_ = SqueezeNetInput(image: image)
        return try self.prediction(input: input_)
    }
}

SqueezeNet is not a subclass of MLModel. It’s a wrapper with methods for loading and driving a model property. Xcode generates input and output classes for the model as well.

Feature Introspection

In the end, a class like SqueezeNet is a high-level, type-safe convenience. At a lower level, it’s possible to drive the model directly and generically. This is how a framework like Vision is able to inspect and use an arbitrary model at runtime.

MLModel sits at the heart of Core ML. It’s an abstraction that’s focused on input and output features. MLModelDescription indicates how these features are structured. An MLFeatureProvider delivers input into the model. A corresponding MLFeatureProvider contains output.

Image Processing and Classification

Using the Vision framework, a VNImageBasedRequest turns an image into observations.

VNCoreMLRequest is a type of image-based request that uses a Core ML model. Depending on the output description of the model, the request will automatically produce either classifications, pixel buffers or feature values.

As Matthijs Hollemans mentions in his post, it’s important to match image input to the preprocessing that a model expects. Much of this can be addressed when the model is converted to Core ML. From the modelDescription, the Vision framework is able to convert most images into the pixel format that a model wants.

Another kind of compatibility to be aware of is image orientation. This is particularly true for iOS where captured images typically align with the camera sensor instead of aligning to the orientation of the device. Request handlers in the Vision framework offer a way to specify orientation when applying Core ML requests to camera images.

Conclusion

“A successful book is not made of what is in it, but of what is left out of it.”

— Mark Twain

When using an MLModel, an app is free to focus on input and output. Model implementation is set aside as optimization detail to be navigated by the .mlmodel creator who knows the model and Apple engineers who know their lineup of hardware.

Core ML can generate code for a model that is convenient and easy to use. However if your solution involves plugging in a range potential models, any MLModel can be inspected at runtime to query how its input and output features are structured.