Chapter 5
Practical considerations
We have described our proposed framework for action recognition, as well as two
examples of localized features. There are still some practical considerations in action
recognition that we need to consider (Guo et al., 2010b).
5.1
Processing continuous video
In practical cases, we usually need to recognize actions in a continuous video. However, with limited memory, sometimes it is impossible to process the whole video. It
is therefore necessary to partition a video into segments (Fig. 5·1), and each time
we just need to process a video segment. Additionally, using video segments can
enhance the robustness of our action recognition approach. A query video, without
being partitioned into segments, can only lead to one feature log-covariance matrix.
If it is misclassified, there will be no chance to obtain the correct label for the query
action. On the contrary, if the query video is divided into segments, it can still be
correctly classified even if a few segments are misclassified.
5.2
Length of segments
Many actions, such as walking, running, jumping etc., are roughly repetitive (Fig. 5·2).
Therefore, for repetitive actions, an appropriate segment length is the approximate
number of frames in an action period. The typical period for many human actions
is on the order of 0.4-0.8 second (except for very fast or very slow motion). For a
43 44
Figure 5·1: Illustration of video segments.
camera operating at 25 fps, the typical length of an action segment is 10–20 frames.
Figure 5·2: Illustration of the repetitive nature of action “walking” 45
5.3
Temporal misalignment
Suppose there are two videos A and B that include the same action class. Both of
them are partitioned into video segments. It is possible that a video segment from A
cannot find a good match in video B, due to temporal misalignment. This motivates
the need to break a video sequence into successive overlapping action segments, as
shown in Fig. 5·3. By doing so, actions in video segments can be better synchronized
and it is more likely to find well-matched segments. Overlapping video segments can
also enrich the training set so that action classification can be more reliable.
Figure 5·3: Illustration of overlapping segments.
5.4
Majority rule
After partitioning a query video into overlapping segments, we can apply our action
recognition approach to each video segment and obtain a sequence of action labels.
Suppose each query video only involves one action class, then segment-level action
labels can be fused into sequence-level decision based on the majority rule, which
take the most frequent segment label as the label of sequence. An example is shown
in Fig. 5·4. 46
Figure 5·4: Illustration of the majority rule.
5.5
Summary of overall approach
Based on our action recognition framework and practical considerations, the overall
approach can be summarized as follows:
1. Partition videos into overlapping action segments (Fig. 5·5);
2. Represent each action segment using the log-covariance matrix of features (Fig. 5·6);
3. Classify each query segment based on NN or SLA classification algorithms
(Fig. 5·7);
4. Use the majority rule to decide the action label of the query video (Fig. 5·4). 47
Figure 5·5: Illustration of segment partitioning.
Figure 5·6: Illustration of action representation. 48
Figure 5·7: Illustration of action classification.