This project has been published in the paper: Exploration of Foot-based Text Entry Techniques for Virtual Reality Environments
Virtual Reality (VR) provides immersive experiences that have fundamentally transformed the way humans interact with digital information. As VR becomes increasingly widespread, the demand for efficient text input methods continues to grow. Traditional text input techniques in VR, such as handheld controllers and hand gestures, become impractical when users’ hands are occupied or disabled. Although existing research has explored hands-free interaction alternatives like voice recognition and eye-gaze interaction, these methods are often hindered by issues such as ambient noise interference and calibration challenges. To address these limitations, we propose a novel hands-free text input technique based on foot interaction, addressing the shortcomings of existing approaches.
We first conducted a preliminary study to evaluate the feasibility of tap and swipe input approaches using foot-based interaction in both standing and sitting positions. Then three foot-based techniques(two tap-based methods and one swipe-based method)were developed to enhance the system’s performance and usability. We also designed an arched QWERTY keyboard with an ergonomic layout tailored to the natural movement trajectories of the feet and legs, thereby enhancing user comfort and usability of our system. These techniques were evaluated in a subsequent user study, revealing entry rates of 11.12 WPM and 10.80 WPM for the tap-based techniques, and 9.16 WPM for the swipe-based technique.
We used an HTC Vive Pro 2 for this experiment. It had a dual RGB low persistence LCD screen, a 2448 × 2448 pixels per eye resolution, and a 120Hz refresh rate. It was connected to a Windows 10 Pro PC with an Intel i9-11900 CPU and an Nvidia GeForce GTX 3090 GPU. The techniques and virtual environment were implemented using Unity3D (v2021.3.1f1) with SteamVR Unity plugin (version 2.7.3) and an HTC Vive Tracker 2.0. We chose a comparatively low-cost setup to ensure the wider applicability of our findings to devices available in the current market.
Users’ reluctance to learn new keyboard layouts, due to the challenges associated with adapting to different configurations, has contributed to the widespread adoption of the QWERTY keyboard layout in VR virtual keyboards. Therefore, we utilized the QWERTY layout to minimize learnability costs.
In the VR environment, we repositioned the keyboard interface from beneath the user’s feet to their direct line of sight. This modification aims to enhance user comfort and efficiency by eliminating the need for frequent head rotations to view a keyboard located under the feet. The interface within the VR Head-Mounted Display (HMD) consists of a text display area and a virtual keyboard. The text display area presents both the transcribed sentences and the text input entered by users.
The virtual keyboard has the following specifications:
This study included two experiments, both employing within-subjects designs. In Study 1, the independent variables were Technique and Posture, resulting in four conditions. In Study 2, the independent variable was Technique, with three conditions. The order of conditions in both studies was counterbalanced using a Latin-Square approach to minimize order effects.
For both experiments, participants transcribed 12 sentences per condition, sourced from MacKenzie and Soukoreff’s phrase set. The first two sentences were designated for training and not analyzed, while the remaining ten were recorded for analysis. Participants completed a consent form, a demographic questionnaire, and an introduction to the tasks and VR setup. During each session, participants focused on speed and accuracy while seated in stationary chairs to avoid movement constraints. After each session, they completed post-task questionnaires, and semi-structured interviews were conducted at the end to gather feedback. Five-minute breaks were provided between sessions, and each experiment lasted approximately 60 minutes.
We implement interaction by tracking the position and orientation of the HTC Vive Tracker. The process involves the following steps: First, we map the user’s physical foot input area in the real world to a virtual keyboard in VR. The virtual keyboard is positioned 10 meters away from the user (i.e., $( z = -10 \, \text{m} )$), while the physical input area is on the ground plane. Therefore, the mapping is essentially a 2D coordinate transformation between two planes.
1. Length Mapping Before performing coordinate transformation, we account for the scale differences between the physical and virtual worlds. To ensure the physical foot input area matches the proportions of the virtual keyboard, we introduce a scaling factor $s$ to adjust the physical coordinates. Specifically, we take reference points $(x_1, y_1)$ and $(x_2, y_2)$ from the physical input area, as well as $(x_3, y_3)$ and $(x_4, y_4)$ from the virtual keyboard, and compute the scaling factor $s$ using the formula:
\[s = \frac{L_{\text{virtual}}}{L_{\text{real}}}\]where:
$L_{\text{real}} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$
$L_{\text{virtual}} = \sqrt{(x_4 - x_3)^2 + (y_4 - y_3)^2}$
The scaling factor $s$ ensures that the physical input area is scaled proportionally to the virtual keyboard.
2. Rotation Angle Calculation To compute the rotation angle $\alpha$, we use the angle formula between the direction vectors of the physical input area and the virtual keyboard. Let the direction vector of the physical input area be $\mathbf{v}_1 = (x_2 - x_1, y_2 - y_1)$ and the virtual keyboard’s direction vector be $\mathbf{v}_2 = (x_4 - x_3, y_4 - y_3)$. The formulas for $\alpha$ are:
\[\cos(\alpha) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{\|\mathbf{v}_1\| \cdot \|\mathbf{v}_2\|}, \quad \sin(\alpha) = \frac{\mathbf{v}_1 \times \mathbf{v}_2}{\|\mathbf{v}_1\| \cdot \|\mathbf{v}_2\|}\]where:
3. Coordinate Transformation Next, we apply a combination of rotation and translation transformations to map the physical coordinates $(x, y)$ to the virtual space coordinates $(x’, y’)$. The transformation formulas are as follows:
\[x' = s \cdot \big[ (x - x_1) \cos(\alpha) - (y - y_1) \sin(\alpha) \big] + x_3\] \[y' = s \cdot \big[ (x - x_1) \sin(\alpha) + (y - y_1) \cos(\alpha) \big] + y_3\]where:
For selection function, intentional toe tap is recognized when the upward lift of the toes exceeds 10 degrees. This threshold is intended to differentiate intentional toe tap for selection purposes from natural foot-tip elevationsThe tapping-based text entry method handles the noise of input and predicts the input words with a statistical decoding algorithm.
The basic implementation is shown below:
1. Input Modeling
As the user enters each character, record the current input sequence $C$. Then, model $P(C|W)$ using the following formula:
To handle tapping errors and input noise, we used a Gaussian noise model.
2. Pruning
To reduce the complexity of word prediction, prune words from the lexicon whose initial letter does not match the initial letter of the input sequence $C$, i.e., For an observed input sequence starting with $c_1$, retain only words $W$ in the lexicon where the first letter $w_1$ satisfies $w_1 = c_1$.
3. Word Prediction: For each word $W$ in the pruned lexicon, calculate the posterior probability $P(W|C)$ using the following formula:
\[P(W|C) = \frac{P(C|W) \cdot P(W)}{\sum_{W'} P(C|W') \cdot P(W')}\]Then select the top five words $W^*$ with the highest posterior probabilities and return them to the user interface (UI):
\[W^* = \arg\max_{W} P(C|W) \cdot P(W)\]
1. Lexicon Representation
First, we converted each word in the lexicon (mentioned above) into a path. Each word is represented as a sequence of line segments connecting the center points of the corresponding letters on the virtual keyboard.
For example: The word “cat” is represented as a path connecting the centers of the letters “c”, “a”, and “t”.
This representation allows the algorithm to model the spatial structure of words for gesture-based decoding.
2. Input Preprocessing
After each commencement, we recorded the user’s gesture as a trajectory consisting of a series of points.
The trajectory input is controlled by the user’s foot tapping, which defines the two critical points: The start and end points as shown in figure.
Instead of using the raw points, these two points are processed and mapped to the center positions of the corresponding keys.
These mapped key centers are then used to prune the lexicon by filtering out words whose corresponding paths have start or end points that are too far away (e.g., more than one key-width) from the gesture’s trajectory, which decreased the complexity.
3. Gesture Path Matching For each candidate word, compute the similarity between the word’s path and the gesture trajectory:
The shape score is computed by calculating the pointwise distances between the sampled points of the gesture and the templates. To enable comparison, we first adjust the size of gestures and templates to be consistent (L = 200). These distances are summed to produce the final shape score(the lower, the better).
The location score is computed by calculating the pairwise Euclidean distances between the points of the gesture and the templates. A weighting factor (alphas) is applied to emphasize the central points of the gesture.
Finally, we combine the shape and location scores into a single integration score
Text Entry Performance
Entry Rate
Total Error Rates:
Both FootTap and FootGesture achieved acceptable entry rates (over 7 WPM sitting, over 6 WPM standing) and low error rates (below 0.1%). However, entry rates were slower in the standing posture (6.73 WPM for FootTap, 6.67 WPM for FootGesture).
Usability and Perceived Workload
ASQ and PSSUQ Scores
Workload (NASA-TLX)
ASQ and NASA-TLX scores indicate that participants preferred typing in the sitting posture, where physical demands and overall workload were lower.
Interview
Entry Rate and Error Rate
Entry Rate:
Error Rates:
Usability (PSSUQ Scores)
Perceived Workload (NASA-TLX)
RM-ANOVAs found significant effects of Technique on: