This is harder than you might expect because it's hard to tell whether a passing...

This is harder than you might expect because it's hard to tell whether a passing test is a false positive (i.e. the test passed, but it should have failed).

It's also hard to convey to the testing system what is an acceptable level of change in the UI - what the testing system thinks is ok, you might consider broken.

There are quite a few companies out there trying to solve this problem, including my previous employer https://rainforestqa.com