How can Explainable AI Help Tackle Black Box Vision in Urban Analytics?

Updated: Aug 24


Image: axis.com

by Tim Alpherts


Urban Analytics is the study of analyzing or predicting city data, simulating urban processes, or providing tools to deal with these. Many theories in social sciences regarding the influence of neighbourhoods on the quality of life of its citizens have only been tested at a small scale because of the efforts of collecting data through surveys. With the advancements in Computer Vision over the last ten years, and the abundance of streetview imagery we are now able to predict these characteristics using Deep Vision models.


The problem with current Deep Vision models for prediction of socio-economic indicators

People in cities tend to have a higher average socio-economic status compare to their rural counterparts, yet there are significant inequalities within cities. In order to map these inequalities at a large spatial resolution we can use Computer Vision and panoramic imagery. Using panoramic imagery researches have been able to predict important influences such as beauty or safety of a neighbourhood, health characteristics, or housing prices. While the results are interesting, the approach tends to be a black box: extremely dense and complicated models that aren't, if not barely interpretable. Because of this, these models that show great potential can't be used by the civil servants that would benefit from them.


What are the issues that prevent municipalities from using these models?

Building a model that would be fit for implementation within a municipality needs to take into account who the users of its technology are, but also what the impact is of its technology. Using an intelligent system to predict neighbourhood characteristics is inherently sensitive. Not only are cities complex, but our model intends to make decisions regarding its inhabitants. This is why first and foremost our models need to be explainable. By seeing the reasoning behind a models decisions we can take responsibility for our actions that are a result of it. The issue is that many explainable computer vision methods are not explainable in practical sense. An example such as saliency maps offer information as to where the model is looking in an image, but does not necessarily gives us concrete information that can be used for policy development.


A second important issue to address is the ease of implementation, or rather the lack of it, in models built for Urban Analytics using computer vision. Many of these models are built using Deep Vision, along with large datasets labelled by humans. If a municipality seeks to implement such a model, the requirements are high. Ideally we would use a model that does not require intensive labelling practices, or even better, no labels at all. That way we are only reliant on existing data and we keep the barrier for implementation manageable.


How can we build a model that overcomes these issues?

To solve these issues it would be easy to not look at Computer Vision at all, as it is dominated by complex models that offer little insight to the lay user in terms of its reasoning. However computer vision using streetview images remains a fantastic potential tool to measure indicators of inequality at large spatial resolution. This is why we look further than the complex systems towards inherently explainable AI.


With inherently explainable AI we look at intelligent systems that are less complex than their black box counterparts, but of which we can see the reasoning in a much more intuitive way. Instead of deploying convolutional neural networks after which we try to retroactively reason in what way they make decisions, we deploy lower level AI models that provide reasoning in a more human way. Discovering visual elements in patches, for example, provides a much better view of what is important when looking at images from a distribution as complex as a city.


Moreover, discovery of visual elements can be done in a weakly-supervised way, with no need for human annotation. We only need GPS coordinates after which we can let the model discover what elements are discriminative for certain cities or neighbourhoods. Not only does this allow us to build and train the model with less human effort, but it also allows for easier implementation in the places where it will eventually go into production.


Conclusion

To tackle inequality within cities it is important to be able to quickly find predictors for a shift socio-economic status among the population. Explainable computer vision allows us to not only find these predictors, but to deploy the model in practical setting as it is designed with the intent of direct deployment.