The discussion between Marek Tatara and Karol Duzinkiewicz, our Computer Vision Researcher, concerns the aspects of dynamical progress in the field of Machine Learning as applied to various computer vision tasks. “It has proven to be very efficient and accurate, sometimes reaching levels of accuracy beyond human capabilities. Nevertheless, classical computer vision algorithms and analytical processing are still heavily relied on and cannot be easily removed from the vision processing pipelines”, says Marek. The interview covers the boundaries of applicability of Machine Learning, future trends, and the importance of classical methods.
A short introduction of our expert
Marek Tatara, Chief Scientific Officer at DAC.digital, is an Assistant Professor at the Gdańsk University of Technology, an AI/ML Expert at M5 Technology, and a Member of the Polish Society For Measurement, Automatic Control And Robotics. He works on the implementation of both EU-funded and commercial R&D projects from the fields of Computer Vision, Machine Learning, and Embedded Systems. Marek is the author and co-author of several scientific publications, including on signal processing, process modeling, diagnostic systems, and evolutionary music composition.
- Let’s start with the title – where are the limits of Machine Learning for Computer Vision?
For now, the limits are related to both computational aspects as well as the capabilities of ML algorithms. Some of the latest neural networks require huge amounts of data and resources to be trained, which is often out of reach of startups or can be very expensive. On the other hand, although there has been a certain leap in the capabilities of neural networks over the past decade, the tools are not perfect yet, and some of the classical algorithms can be more efficient than ML-based algorithms.
- Let me ask from the other perspective – what are the advantages of ML approaches?
Talking about the advantages of ML algorithms, computational efficiency, and the ability to generalize are aspects that should be considered. For some of the models, the computational efficiency may greatly exceed the efficiency of classical methods (depending, of course, on the task and available resources). There are also some tasks where classical algorithms are inferior in terms of accuracy as compared to ML algorithms (like object detection or 3D reconstruction). Another thing that should be considered is what was enabled by transfer learning techniques – even though an original network was trained to handle one task, it can be restrained to perform other tasks with a smaller overhead than training from scratch. It is caused by the fact that some of the deep features can be reused among different applications, therefore, the convergence of the training proceeds faster.
- How about the advantages of classical methods?
One of the advantages of classical algorithms is their availability (in different implementations and libraries, sometimes parallelized, for different hardware), optimization for a specific task, and predictable outcome. Another advantage of classical algorithms is related to their explainability, which is embedded in them by design, while XAI is a field that emerged relatively recently and is still under heavy development.
- How would you describe the expertise needed to work with classical methods as compared to Machine Learning?
ML and CV are distinct fields, but recently are strongly overlapping. For CV engineers it is crucial to understand the process of image acquisition, have knowledge about the physical and mathematical relationships between the image and the real world, as well as know-how digital filters are working, stay up-to-date with Computer Vision algorithms, and ‘feel’ the image processing. On the other hand, being an ML engineer requires a bit different skillset, especially in terms of signal processing (as ML engineers may not focus on images only), but should be on track with recent advances in the field (especially as the field is progressing rapidly), and know when not to use ML and go for a more classical pre- or post-processing methods. Nevertheless, nowadays both fields share some of these competencies and are nice to have in both – CV engineers should be able to deploy a simple neural network, while ML engineers should have some background in image processing.
- Is it beneficial to combine ML and classical methods?
Yes, it is, and sometimes it’s even indispensable. Even though for some tasks ML models can achieve high accuracy and efficiency, bringing models to such a state often requires large volumes of data and many experiments with different architectures. In some cases, a solution that is achievable faster connects the ML model with some accuracy (not necessarily high) with some pre-processing or post-processing steps. The first one is applied when classical methods can be used to extract meaningful information from input data. Post-processing is used when the output of a model needs to be corrected or processed to get more information (machine- or human-readable). Other cases when both methods come together are safety-oriented redundancy (using both methods and seeing if the outputs are consistent), edge cases handling (where it may be difficult to collect data to train the model but they can be bounded and detected by classical methods), or embedding physical aspects to the output of the network (e.g. fitting free-fall model or use 3D math to link pixels with real-world). There’s definitely more than that and to build a nicely working processing pipeline, both approaches are usually combined, and with a nice blend of those the outcomes can be spectacular.
- What are the most common areas where CV is used?
It is difficult to point single industry as CV is a cross-sectoral field with knowledge transferable between horizontals. Nevertheless, numerous applications can be seen in automotive to monitor car’s surroundings and objects nearby (and CV is supported by data from other sensors), MedTech, where CV can be used to detect diseases/anomalies from medical imagery or could assist surgeons during operations, entertainment where user engagement can be measured or it, can be applied to develop vision-based interfaces to control virtual environments (in Virtual or Mixed Reality), in robotics to open up the potential for the implementation of adaptive robotics or reconfigurable manufacturing, or in forestry/agriculture to analyze crops, detect diseases or plan harvests.
- We’re talking about Computer Vision – but are there any ways to boost the efficacy of CV with, for instance, other types of data?
Incorporating other data sources could be a real booster for some CV applications. We could measure some points of reference in the world to have an anchor for further computations, we could incorporate IMU data to have precise data about movement, we could include data from geolocalization which can be used for further image/mesh stitching, we could fetch the data from a medical database or patients history to have more accurate results, or we can combine CV with LIDAR data to have a 3D models with visual aspects appended on top of them. In terms of implementation, the processing speed could be increased by using hardware accelerators, going for parallel processing, or in the case of neural networks, by reducing their size (by pruning or knowledge distillation).
- Let’s talk about privacy – how can we assure the privacy of data? How to be compliant with GDPR?
Nowadays, privacy is a key concern, and everyone has the right to privacy. One of the ways to be compliant with GDPR regulations is to anonymize the data, which addresses the right to privacy, but on the other hand, some of the data may not be usable afterward. Another option is to move the processing to the edge, where all the sensitive data are handled in the direct proximity of the cameras and are not being stored or sent anywhere. The data could also be processed and transformed into a representation that makes it impossible to retrieve sensitive information but still convey the same information as the original data.
- Let’s move on to ethics – where is the line? Should we set some boundaries? Are there any methods to deal with that?
It is all case-dependent, and there’s a thin line between striving to deliver innovations to the general public and at the same time keeping all the ethical aspects fixed. For instance, although there are certain safeguards in the recently rolled-out large language models, there are still ways to retrieve some information that should not be provided by such a model. Similar happens with street view in Google Maps, where faces are usually anonymized, but there may be some cases that were not caught by the anonymization algorithm, and you can access views that should not be available to the general public. Another example is the issue of copyright when using generative models, which is still rather fuzzified and unclear in many cases. Should we set boundaries? Certainly, but they should be well thought-out, and sometimes it may be difficult not to cross them as a bit of unpredictability is embedded in the ML models, and could not always have full control over them.
- Can we say something about the importance of data for the development of Computer Vision algorithms? Is it different for classical and Machine Learning methods?
For ML methods data is the new gold, while for classical algorithms – it is as well. The difference is that in the latter case, we could develop an algorithm based on a certain reasoning and we can explain how it will work before it is implemented – it can be planned in our heads and then implemented. So why do we need the data in this case? To verify that the algorithm works, to have sanity checks, to implement functional and qualitative tests, and to characterize the target environment and setup-related peculiarities. Without that, it would be unreasonable to productize the CV algorithms, no matter whether we use ML methods, classical algorithms, or both together. Having tests that give some performance indicators could help with the development of the processing algorithms and could be used to assess the efficiency of the whole pipeline, but also subsequent processing steps separately.
- Are there any applications where machine learning cannot compete with analytical algorithms? If so – why is that?
In the simplest tasks, where the classical methods are well grounded, ML cannot outperform them as the filters are very simple and give results that are good enough. It all depends on the required level of accuracy, time constraints in terms of computational speed, as well as the costs related to the development of the algorithms. If simple classical methods give results that can be processed to extract the desired information with a certain level of accuracy, it would be unjustified to put a high budget into training a more accurate ML algorithm, which can be potentially slower, more expensive in terms of upkeep, and could give only slightly better results (which may not be necessarily needed).
- How do you see the importance of ML in CV or evolution in this field in the upcoming years?
I think that it will be very rapid and the emergence of new models will revolutionize the field even more than it has so far. More accurate models, less data needed, and semi-supervised or unsupervised learning could strongly affect the importance of ML in CV. Efforts being put into Explainable AI (XAI) methods could help us to better understand why models gave specific results and could be the means to productize the models more easily and have more confidence that the results are correct. This, in turn, could be an enabler to deploy such models to work together with humans, serve as standalone decision systems, and even be implemented in safety-related applications. On the other hand, the development in terms of hardware accelerators and the possibilities of thinning neural networks would move high volumes of processing to the edge. This, in turn, will make more data available, and the interconnectivity within a short range in near-me or body wireless networks will be crucial for the new applications to emerge. Finally, the application of 3D vision systems and real-time processing of such data, especially in Mixed-reality environments, are other fields that can be revolutionized in the upcoming years and huge efforts put into research here can bring a certain leap with respect to what we observe nowadays there.
Computer Vision Community MeetUp
The interview is a summary conducted during the discussion panel at our Computer Vision Community MeetUp in Berlin. The event brought together some of the most talented individuals and innovators to exchange ideas, build collaborations, and expand their networks. Discussions and presentations revealed many intriguing topics and thoughts – thank you for joining us!