Gaze as Context: a foray into intuitive interaction

As we here at Mirametrix have officially detached from the mother ship that is Tandemlaunch (though we’re still stuck in their office – shhhh ), it seemed like a perfectly appropriate time to talk about our views on human-computer interaction and how we see it evolving. While outwardly a standard eye tracking research company, we’re committed to aiding the push towards a future where device interaction is both more intuitive and simple.

Minority Report’s Futuristic Human-Computer Interface

Now, if one were to base this ideal future on recent Sci-fi films, body-pose based gesture interaction would be the name of the game. Although the interface demonstrated in Minority Report certainly had the strongest effect on our general perceptions of futuristic interfaces ([1] is a great read on the subject), these types of interfaces have appeared for decades now in literature and film. And with good reason: for a large number of applications, gestures can directly mimic how we would interact with things in the physical world. They look darn cool, to boot. However, there are some things gesture is inherently unsuited to handle: Abstract interactions, which lack direct physical counterparts, are particularly difficult to model via gesture. The state of current gesture interfaces brings to light another major hurdle: while fairly good at understanding the user pose and gesture used, these devices have difficulty determining what the user is attempting to interact with in the virtual realm. This is most apparent via the interaction schemes for menus or dashboards, such as the one shown below:

XBox One Dashboard Interface

The intuitive method of interacting with such an interface would be to point at or grab the item you wish to select. However, this is not feasible due to the constraints of gesture technology. The most common interaction scheme mimics the standard pointer, where you perform a gesture to initialize a mouse, move it to the correct object, and perform a second gesture to select. This is a step away from the direct interaction people are trying to achieve. By adding this layer of abstraction, we in many cases remove the “naturalness” of the interaction, reminding us that we’re still dealing with a mouse-click style user interface.

Similarly, despite major strides, current voice systems are for many use cases over-developed. The grand majority of these systems expect any phrase as input, which introduces some problems: first, it forces added complexity to the necessary language processing, as the system must be able to reliably and efficiently extract the syntax of the given phrase, understand the meaning behind the potential command, and determine if it applies to the system it’s functioning on. Keep in mind, this assumes the system has been able to accurately convert a given audio feed to words, a difficult problem in and of itself due to the vastly different accents existing in a given language. Secondly, such voice recognition systems must by their very nature fix the user’s interaction to purely objective commands, removing the deictic phrases we tend to use in normal conversation. Even with the voice component of the new Kinect, which has smartly been trained on a set command list[2] (reducing the scope of the problem, and increasing it’s accuracy), this lack of deictic commands is non-ideal. In the sea of available items on the dashboard, sometimes the most natural command we want understood is “Select that”.

Now, this problem of context is not exclusive to voice, but inherent in the grand majority of these new interaction devices we’re seeing on the market. From Kinect to Siri to Leap Motion, great products are being released, but they’re being held back by the isolationist nature of their ecosystems. And the solution to this problem is by no means new: since the advent of these new modalities, there has been interest in roping them together into multi-modal systems. Think about the last interaction you had with a friend; how often did you refer to contextual things (this, that), point at objects, or indicate what you were talking about by looking at it? We interact innately via multiple modalities (talking, looking, pointing, gesturing all at once), so it only seems natural to want our computers be able to decipher these multifaceted commands.

To slide back to how Mirametrix fits in here: we’re in the gaze tracking business because we think it’s an underlying key to gluing these modalities together. When referring to things in visual space, we naturally fixate on what we are referring to. This simple key, context, can immensely simplify the recognition problem in a number of scenarios. In the case of gesture mentioned earlier, this context can simplify the gesture tracking problem from one of perfect 3D pose tracking to a simple set of different gestures. Similarly, the voice recognition problem becomes one of differentiating between sets of synonyms in the case described above. This idea of context as a unifying/simplifying component is quite powerful: it not only determines what we’re referring to during interaction (connecting separate modality actions), but also allows the connected modalities to be simpler and more lightweight, by virtue of this contextual association. It is via this simple tenet that we hope to make interchanges between us and computers truly natural.



How to Spot Technology Trends II: Replacing ‘Impossible’ with ‘Never Before Possible’

This is part 2 of a post on spotting technology trends.  In the last post I explained that the best place to look for technology trends is my mother’s demographic (a mainstream demographic, with disposable income, that is not very technologically savvy). In this post, I describe the mindset required to do that.

My answer whenever my mother makes some unrealistic demand for realism or quality is normally to explain that what she wants technology to do is not possible (mainly for my peace of mind).  In the back of my mind though, I’m always careful to recognize that what is not currently possible, is entirely different from what is impossible.  The development history of dynamic LED TV is a perfect example of how technological development can be crippled by the confusion of these two very different ideas.

In the early days of what is now called LED TV, or local dimming TV, we knew that people love sparkly stuff.  It’s nothing new: people like diamonds, crystal chandeliers, crystal glasses, gold, silver, chrome and other shiny materials.  We even polish our motorcycles and cars.  God knows why. Maybe due to some evolutionary advantage to spotting the eyes of predators, but we are attracted to things that shine.  The problem with displays at the time was that they couldn’t show sparkly stuff.   Displays were limited to a range of brightness that made it impossible for them to show sparkles (8-bit per channel so 255 steps from black to white); they just couldn’t produce enough bright light to give the impression of a reflection. After a few decades of living with these limited displays, the bulk of the TV industry concluded:  “Making this better is hard, so 8-bit is the end of the line, and we just accept this constraint.”

 A whole community – an entire industry – convinced itself that, ‘8-bit is enough.’

A whole field of science sprang up to argue that 8 bit was all that human beings could see (often through industry sponsored research).  When you pointed out that in the real world the eye can see more (I can see shades under my desk and, simultaneously bright buildings through the window), they would just respond that my brain is fooling me, that sparkles don’t really matter, that you don’t consciously perceive it more than 8-bit, or that it’s simply not that important. The Emperor had beautiful clothes…

The consequence was that all development of display technologies that would go beyond  8-bit ground to a halt. Industry, investors, and scientists collectively simply said no to a whole realm of possible innovation. At Brightside I remember pitching to a venture capital fund who had hired an “8-bit is enough expert”. Even looking right at our display and physically seeing that 8-bit is not the limit of human vision, they would stick to the line that the technology wasn’t needed.

A much smaller camp of rebels admitted that 8-bit might not be enough, but decided to focus their energy on dealing with the problem instead of solving it (clearly, no one was encouraging them to solve the root problem).  So they developed so-called tone mapping algorithms, and other image processing techniques that basically allow you to squish the high contrast real world scene into the limited range of an 8-bit display.  At least they were acknowledging the problem, but they were unfortunately focusing all of their energy on addressing a problem based on an arbitrary technical constraint

That’s right, an arbitrary constraint. Because deep down we all know where the 8-bit magic comes from. Somebody in the early days of computing decided that 8-bit would be a nice unit for microprocessors, later CPUs, and ultimately for the operating systems that ran on those CPUs. With the advent of digital displays in the 90s it was simply convenient to use the same integrated circuits, processors and software on the display side as well. It would be an incredible coincident if the human visual system just happened to have the same constraints as an arbitrarily chosen electronics requirement driven by supply chain convenience!

It took us years at BrightSide to break through these self-created mental barriers and push dynamic LED TV as a higher contrast/bit-depth solution. Today, LED TV accounts for some 30-40% of the global display market and 8-bit displays are quickly disappearing from the market entirely.

Let’s come back to innovation. The problem in this example wasn’t that people were stupid or unwilling to listen. It was that a myth had spread to the point that it had become part of the fabric of that field. The vast majority of display researchers, engineers and marketers didn’t know enough about visual psychophysics to invalidate the myth in the mind, so they perpetuated it as gospel instead. Such people have difficulty articulating the true problem statement(s) in their field and thus have no hope of developing anything other than incremental improvements.

Major disruptive innovation comes from the recognition of a major problem. Inspiration for that is easier to find with those who imagine the world as it should be, rather than those who “know” how it is.

How to Spot Technology Trends I: Mother Knows Best

Spotting technology trends is a subject that is at the forefront of our minds at TandemLaunch.  While I can’t offer any silver bullet solutions, this will be the first of a 2 post series on the subject.   I look forward to some good discussion on the topic. 

When you build a product or design a technology that will actually become a product, there are really only two business models.  There is B2B (business to business), where you try to make an operational aspect of your customer business more efficient, and B2C (business to consumer), where you ultimately try to create great products that enable human beings to do what they like to do.  I will focus on B2C products.

The question, of course, is how to find out what people like to do.  The answer is to talk to my mother.  What I mean is that you look at a demographic in the main stream, that is not a technological leader in particular, and that has no patience at all for technology hikes. In fact, a demographic that generally does not even understand hikes.  If you are in my age group, that demographic is your parents.  It might sound insane, but I look at my mom as a predictor for the technological future because she barely knows how to use a computer, she certainly couldn’t use the Commodore 64 that I started writing simple lines of code on, and her knowledge of computers in the last 20 years has not improved dramatically.  Here’s the thing though, my mom has a long list of things that she would like to do, that technology can enable, and she would clearly buy these products if they were available. 

For all of you eager and aggressive entrepreneurs in the tech space, it seems silly to look at a past generation to define what the future might be, so let’s look at an example. My mother lives a continent away and wants to see her grandchildren. ‘No problem,’ says technically savvy me.  We end up buying a TV with a built in camera, get Skype set up, and when I’m all pleased with the setup I’ve created, my mom points out that the image is not so great.  I stare blankly at the image, trying to diagnose the problem, and figure out what has gone wrong on her side. Ultimately, I realized that her side looks just like my side.  But my side looks fine. Why?

On my side, I’ve mentally ignored a whole bunch of compression artifacts, small distortions and other little things that somebody who is savvy in the video processing world would just discard and filter out.  My mom doesn’t.  My mom wants to see a picture of her grandchildren that is pretty clear the way it would look like on a photograph, or even better, the way it looks like when they are right in front of her.  Over time I have learned that my mom and people like her make all kinds of ‘unreasonable requests’ for realism, quality, or systems that they can operate like the objects they are already used to.  A lot of times when people don’t have an intuitive understanding of what is technically feasible, they simply articulate what they want to do based on how they do it from one human to another human.  If you were to see my children in person, they would look crystal clear, in full colour, and in three dimensions. Of course this is what my mother wants.  Unless somebody tells her that it is not possible, she will continue to ask for it.  And if someone were to make it possible, she would spend her disposable income on it. 

Furthermore, the economic future of most of our major web 2.0s or current technology successes are based on people like my mom.  Facebook was put on the map by college kids that were into this ‘new social media thing,’ but if Facebook is ever going to have any hope of economic success, truly monetizing and having a long life, then it has got to tap into the average user: the 43-50 year old woman who plays Farmville all day and has disposable income to spend in the system.  These people are the actual monetizing users.  In Facebook’s case they are both the money driver, and in some sense the product (their information gets sold to advertisers and makes even more money).

Like my mom, this demographic not only has a profound economic impact on future business, they are one of the best indicators of how the future will look because they don’t have the same mental limits that people who are knowledgeable about technology intrinsically have.  They are easily able to articulate what they want, and they don’t let the ‘technically possible’ limit what they ask for.