In part 1, we saw how today’s Deep Learning tools and data ecosystems make it easy to have an early prototype to assess the feasibility of a common Deep Learning task.
That said, it is one thing to have a workable prototype showing the potential of the approach, it is another thing to reach a reliable enough level of detection to allow the feature to be put in the hands of millions of users.
As shown in this video, the objective of magicplan-AI is to let the user add a door or a window on a given wall, both at the correct position and with the correct size by simply taking a picture of the object while capturing the room.
This capture needs to be reliable in terms of location and size and real-time to avoid downgrading the quality of the already in-production room capture.
By reliable, we mean several things:
To measure the performance of such a system, we use a metrics called F1 , that “summarises” in one single value the combination of most of above requirements (important factors such as true positive, false positive and false negative).
Note: there are other metrics like the mAP (mean Average Precision) that are more complete, but for the sake of simplicity, I will stick to F1 in this story.
It is important to understand that, depending on the usage scenario for a given object detector, errors of different nature can have very dissimilar impact.
A powerful illustration of this is a cancer cell detector, for which the impact of not detecting a real cancerous cell has catastrophic consequences compared to the impact of wrongly classifying a sane one.
The same asymmetry exists in magicplan capture scenario:
This asymmetry in the cost of object detection errors needs to be reflected in the used metrics by handling specifically each error type.
Once the metrics are defined, we can rely on a quantitative approach to evaluate the different iterations of our object detector:
Three methods have been used to improve the F1 score:
In the world of Deep Learning, data is key and there are a lot of identified techniques to improve the training database.
The first one is obviously to increase the size of the training dataset by adding more images or data augmentation techniques.
But surprisingly, for us, the most significant improvement came from cleaning the annotated ImageNet database.
There are several reasons to that:
1. not all images with doors or windows are relevant to magicplan interiors scenario use cases — we call this a domain shift issue,
Not the type of door we are interested in!
2. door and windows are generic words that you can find in situations that have nothing to do with a house interior — we call this a semantic shift issue,
3. crowd-sourcing annotation has its limits and mistakes can be included in the annotation — we call this a mis-annotation issue,
4. while it is ok to annotate only one window in a picture containing several windows for a classification task, it is not ok to forget to annotate visible windows for an object detection task — we would call this a partial-annotation issue,
5. we realized that doors when shot open can be problematic for magicplan-AI as we actually are looking for the door frame, not the mobile part — we call this a task annotation shift issue.
For each type of issue, a fix was identified and applied.
Revisiting the training dataset and removing, re-annotating or re-weighting images, improved F1 score by more than 23%, reaching an interesting 0.85.
Training correctly a model in Deep Learning is as much about the right dataset as it is about the right correction when an error is found during the training.
Fortunately for us, a large literature is available on identifying the right Loss function to use. Even better, on the particular case of Object detection, Facebook Detectron project identified some key improvement in the way of applying the right Loss function, called Focus Loss, that were very easy to implement for us.
As a result, combining the training database quality improvement with the introduction of a better fitted loss function, we were able to significantly improve the F1 score as illustrated below.
Academic research has been quite active in the Object Detection field and several architectures are available for Deep Learning for Object Detection.
They can be grouped along two axes:
A — the type of feature extractor that processes the input image:
B — the number of steps to do the full detection:
As expected, the more complex the architecture, the better the performance (see graphic below).
However, what we discovered quite early is that even for the inference task (the task of running the model to perform the object detection on an image — not the training task which requires much more resources), not all architectures fit the constraints of running on a mobile device.
Two reasons to that:
Some architectures do not fit in memory on the device. Some other do BUT it takes several seconds to process one object detection, which is not acceptable in the magicplan real-time capture scenario.
Contrary to the first “quick & easy” stage, being able to play with all the options in the “off the shelf” models requires several conditions:
In our case, this could not have happened without the presence of two full time PhDs in artificial intelligence / deep learning in the team who master these challenges.
At this point we have a best in class model doing a good job in object detection but too big to run on any modern smartphone.
In last part, we will describe more in detail the work required to move from a PC based solution to a smartphone embedded one.