-
2025-11-25
18/144,306
2023-05-08
US 12,483,437 B1
2025-11-25
-
-
Tauqir Hussain
Westborough IP Group, LLC
2044-02-28
Smart Summary: The system detects bad behavior while a user is recording a video by using machine learning technology. It compares the user's actions to a list of known bad habits to determine if any unacceptable behavior is present. If the system is confident about the behavior, it marks it in the video and adds it to a list. If it's unsure, it asks the user to confirm whether the behavior occurred. Once confirmed, even uncertain behaviors can be marked and added to the list, helping users become more aware of their actions. 🚀 TL;DR
Handling unacceptable behavior by a user making a video recording includes detecting the unacceptable behavior by the user while the user is making the recording by applying machine learning to data about the user received from capturing devices. A predetermined list of bad habits is used and recognition accuracy is used for an episode of the unacceptable behavior. The episode is marked in the video recording and added to a list of bad habit episodes for the video recording if recognition accuracy of the episode is high. The user is prompted for confirmation of the episode if the recognition accuracy of the episode is low. The episode is marked in the video recording and added to the list of bad habit episodes if the recognition accuracy of the episode is low and the user confirms the episode. The machine learning may include an initial training phase.
Get notified when new applications in this technology area are published.
H04L12/1831 » CPC main
Data switching networks; Details; Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
G06F1/163 » CPC further
Details not covered by groups - and; Constructional details or arrangements for portable computers Wearable computers, e.g. on a belt
G06F3/165 » CPC further
Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements; Sound input; Sound output Management of the audio stream, e.g. setting of volume, audio stream path
G06N20/00 » CPC further
Machine learning
G06V20/46 » CPC further
Scenes; Scene-specific elements in video content Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
G06V40/107 » CPC further
Recognition of biometric, human-related or animal-related patterns in image or video data; Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands Static hand or arm
G06V40/172 » CPC further
Recognition of biometric, human-related or animal-related patterns in image or video data; Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands; Human faces, e.g. facial parts, sketches or expressions Classification, e.g. identification
G06V40/28 » CPC further
Recognition of biometric, human-related or animal-related patterns in image or video data; Movements or behaviour, e.g. gesture recognition Recognition of hand or arm movements, e.g. recognition of deaf sign language
G08B3/10 » CPC further
Audible signalling systems; Audible personal calling systems using electric transmission; using electromagnetic transmission
H04L12/18 IPC
Data switching networks; Details; Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
G06F1/16 IPC
Details not covered by groups - and Constructional details or arrangements
G06F3/16 IPC
Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements Sound input; Sound output
G06V20/40 IPC
Scenes; Scene-specific elements in video content
G06V40/10 IPC
Recognition of biometric, human-related or animal-related patterns in image or video data Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
G06V40/16 IPC
Recognition of biometric, human-related or animal-related patterns in image or video data; Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands Human faces, e.g. facial parts, sketches or expressions
G06V40/20 IPC
Recognition of biometric, human-related or animal-related patterns in image or video data Movements or behaviour, e.g. gesture recognition
This application is a continuation-in-part of U.S. patent application Ser. No. 18/122,749, filed on Mar. 17, 2023, and entitled “RECOGNIZING AND MITIGATING DISPLAYS OF UNACCEPTABLE AND UNHEALTHY BEHAVIOR BY PARTICIPANTS OF ONLINE VIDEO MEETINGS”, which is a continuation of U.S. patent application Ser. No. 17/226,315, filed on Apr. 9, 2021, and entitled “RECOGNIZING AND MITIGATING DISPLAYS OF UNACCEPTABLE AND UNHEALTHY BEHAVIOR BY PARTICIPANTS OF ONLINE VIDEO MEETINGS”, which claims priority to U.S. Prov. App. No. 63/008,769, filed on Apr. 12, 2020, and entitled “TRACKING HAND MOVEMENTS AND GENERATING ALARMS TO PREVENT USERS FROM TOUCHING THEIR FACES”, all of which are incorporated herein by reference.
This application is directed to the field of facial, gesture, hand movement and gaze recognition, and more particularly to the field of recognizing and mitigating displays of unacceptable and unhealthy behavioral habits by authors of immersive video recordings.
Video content has emerged as a dominant productivity, educational and entertainment medium for contemporary businesses and homes. An average Internet user spends about 100 minutes per day watching online video content. 95% of Internet users watch product and service illustrations in the form of explainer videos; consumer polls show that 84% of the watchers made purchase decision after learning product and service features from such videos. It is estimated that viewers can retain about 95% of the information found in a video content compared to just 10% information after consuming textual information.
According to market research, the size of the global enterprise video market will grow from $16.4 billion in 2021 to $52.9 billion by 2030 at CAGR 13.8% (11.3% in the US). Over 60% of this market represents asynchronous, pre-recorded content, and almost 40% corresponds to synchronous video conferencing. Experts underscore two major internal factors driving an accelerated growth of the enterprise video market: the need to improve operational efficiency and employee productivity, and the need to connect the remote workforce distributed through physical offices, homes and on the go in many geographical locations, which reflects the emerging distributed and hybrid work paradigm for the global workforce.
Business applications that benefit from the growing use of video content include professional training, education, e-commerce, marketing, product development and support, business communications and presentations, hiring and onboarding, consulting, etc. Market estimates put the video content for marketing and sales in Banking, Financial Services, and Insurance as the largest segment. The fastest overall growth rate is also expected in the subsegment of market and client engagement of the marketing and sales segment.
Notwithstanding the sizable efficiency gains and numerous benefits from the proliferation of video content, there is a growing list of difficulties and challenges accompanying the broadening use of video solutions, such as inadequate equipment, incompatible software applications, corporate firewalls limiting user access, low network bandwidth, insufficient scalability, etc. The above issues are aggravated by the quickly growing viewer audience and diversifying device base, which includes connected mobile devices, such as smartphone and tablets working on multiple platforms, coming in many form factors, and therefore dictating different preferences for the evolving spectrum of geometries of video frames.
The inventory of technical and organizational challenges for the creation and pervasive distribution of video content has been lately augmented with new issues caused by the expanding possibilities and features of the video medium-specifically, with the increased use of Augmented Reality (AR) and immersive features in video presentations. Market analysts have long predicted and welcomed advancements in AR and VR, representing significant opportunities for enterprise video service providers. Pioneered by mmhmm, inc., Loom, inc., and other technology players, the immersion of a presenter, captured by a front camera during a pre-recorded or interactive presentation, automatically segmented from the presenter's environment and placed in front of the presentation materials, such as a slide deck, with additional features, allowing the presenter to resize, move, and add visual effects to the presenter's image, has become an increasingly popular feature used in many thousands of asynchronous (pre-recorded) and interactive presentation videos.
An important aspect of the immersive video content is cultural and social acceptability of behavior of a presenter by viewers of the video. The acceptability concept involves multiple aspects, ranging from appearance and attire of the presenter and condition of the physical presentation environment (in case it is preferred to a virtual background) to keeping positive attitude and body language and avoiding unacceptable behavioral displays and manifestations of bad habits, such as yawning, yelling at family members or pets, cursing, blowing, or digging one's nose, turning away from the screen during presentation, etc. Such behavioral flaws may occur subconsciously so that avoiding them, especially during a lengthy video recording, may be challenging. Some of these habits may also be unhealthy or even harmful; for example, when presenters are touching their faces, there is a risk of spreading viruses and germs from fingers of the presenter to mucous membranes to cause further infections with dangerous viruses, such as Covid-19 or flu.
In spite of extensive case studies and media reports and notwithstanding the availability of hardware and software capabilities for capturing and tracking behavior of a presenter (for example, facial and gesture recognition technologies are built into all major software platforms, while eye-tracking is supported by a growing list of mobile devices), there is little practical research and assistance in detecting and mitigating unacceptable, disagreeable and/or unhealthy behavior of presenters and other authors of the pre-recorded immersive video content.
Accordingly, it is desirable to develop techniques and workflows for recognizing and mitigating unacceptable behaviors by creators of the instantly immersive video content.
According to the system described herein, handling unacceptable behavior by a user making a video recording includes detecting the unacceptable behavior by the user while the user is making the recording by applying machine learning to data about the user received from one or more capturing devices and by using a predetermined list of bad habits, determining recognition accuracy for a particular episode of the unacceptable behavior, marking the particular episode in the video recording and adding the particular episode to a list of bad habit episodes for the video recording in response to recognition accuracy of the particular episode being high, prompting the user for confirmation of the particular episode in response to the recognition accuracy of the particular episode being low, and marking the particular episode in the video recording and adding the particular episode to the list of bad habit episodes for the video recording in response to the recognition accuracy of the particular episode being low and the user confirming the particular episode. The machine learning may include an initial training phase that, prior to deployment, is used to obtain a general recognition capability for each item on the predetermined list of bad habits having a low recognition accuracy that has been confirmed by the user. The confirmation may be provided by the user and may be used to improve the recognition accuracy of the machine learning. The one or more capturing devices may include a laptop with a camera and a microphone, a mobile device, autonomous cameras, add-on cameras, headsets, regular speakers, smart watches, wristbands, smart rings, and wearable sensors, smart eyewear, heads-up displays, headbands, and smart footwear. The data about the user may include visual data, sound data, motion, proximity and chemical sensor data, heart rate, breathing rate, and/or blood pressure. The list of bad habits may include nail biting, yelling, cursing, making unacceptable gestures, digging one's nose, yawning, blowing one's nose, combing one's hair, slouching, and/or looking away from a screen. Technologies used to detect bad habits may include facial recognition, sound recognition, speech recognition, gesture recognition, and/or hand movement recognition. Yawning may be detected using a combination of the facial recognition technology, the sound recognition technology, and the gesture recognition technology. Face touching may be detected using the facial recognition technology and hand movement recognition technology. Digging one's nose may be detected using the facial recognition technology and hand movement recognition technology. At least some bad habit episodes from the list of bad habit episodes may be presented to the user. The user may delete at least one of the bad habit episodes from the video recording. A portion of the video recording corresponding to the at least one of the bad habit episodes may be re-recorded by the user. A list of identified types of bad behavior may be presented to the user with examples and recommendations. The list of identified types of bad behavior may be used for avoidance of bad behavior for future recordings. The list of identified types of bad behavior may be presented to the user as part of learning program presented to the user. The learning program may track user success in behavior improvement and repeats learning cycles as needed.
According further to the system described herein, preventing a user from touching a face of the user includes obtaining video frames of the user including the face of the user, applying facial recognition technology to the video frames to detect locations of particular portions of the face, detecting a position, shape, and trajectory of a moving hand of the user in the video frames, predicting a final position and shape of the hand based on the position, shape, and trajectory of the hand and on the locations of the specific portions of the face, and providing an alarm to the user in response to predicting that a final position of the hand will be touching the face of the user. Predicting a final position of the hand may include determining if the hand crosses an alert zone that is proximal to the face. The alarm may vary according to a predicted final shape of the hand and according to predicting that a final position of the hand will be touching a specific one of the particular portions of the face. The predicted final shape of the hand may be an open palm or open fingers.
According further to the system described herein, preventing a user from touching a face of the user includes detecting a position, shape, and trajectory of a moving hand of the user based on one or more sensors from a wearable device of the user, predicting a final position and shape of the hand based on the position, shape, and trajectory of the hand and on the locations of the specific portions of the face, and providing an alarm to the user in response to predicting a final position of the hand will be touching the face of the user. The wearable device may be a smart watch and at least one of the position, shape, and trajectory of the moving hand may be determined using sensors of the smart watch. Predicting the final position and shape of the hand may use machine learning that is adapted to the user during a training phase. The wearable device may be a smart ring and at least one of the position, shape, and trajectory of the moving hand may be determined using a proximity sensor, an accelerometer or a gyroscope of the smart ring. The wearable device may be smart glasses and at least one of the position, shape, and trajectory of the moving hand may be determined using a proximity sensor of the smart glasses.
According further to the system described herein, a non-transitory computer readable medium contains software that handles unacceptable behavior by a user making a video recording. The software includes executable code that detects the unacceptable behavior by the user while the user is making the recording by applying machine learning to data about the user received from one or more capturing devices and by using a predetermined list of bad habits, executable code that determines recognition accuracy for a particular episode of the unacceptable behavior, executable code that marks the particular episode in the video recording and adds the particular episode to a list of bad habit episodes for the video recording in response to recognition accuracy of the particular episode being high, executable code that prompts the user for confirmation of the particular episode in response to the recognition accuracy of the particular episode being low, and executable code that marks the particular episode in the video recording and adds the particular episode to the list of bad habit episodes for the video recording in response to the recognition accuracy of the particular episode being low and the user confirming the particular episode. The machine learning may include an initial training phase that, prior to deployment, is used to obtain a general recognition capability for each item on the predetermined list of bad habits having a low recognition accuracy that has been confirmed by the user. The confirmation may be provided by the user and may be used to improve the recognition accuracy of the machine learning. The one or more capturing devices may include a laptop with a camera and a microphone, a mobile device, autonomous cameras, add-on cameras, headsets, regular speakers, smart watches, wristbands, smart rings, and wearable sensors, smart eyewear, heads-up displays, headbands, and smart footwear. The data about the user may include visual data, sound data, motion, proximity and chemical sensor data, heart rate, breathing rate, and/or blood pressure. The list of bad habits may include nail biting, yelling, cursing, making unacceptable gestures, digging one's nose, yawning, blowing one's nose, combing one's hair, slouching, and/or looking away from a screen. Technologies used to detect bad habits may include facial recognition, sound recognition, speech recognition, gesture recognition, and/or hand movement recognition. Yawning may be detected using a combination of the facial recognition technology, the sound recognition technology, and the gesture recognition technology. Face touching may be detected using the facial recognition technology and hand movement recognition technology. Digging one's nose may be detected using the facial recognition technology and hand movement recognition technology. At least some bad habit episodes from the list of bad habit episodes may be presented to the user. The user may delete at least one of the bad habit episodes from the video recording. A portion of the video recording corresponding to the at least one of the bad habit episodes may be re-recorded by the user. A list of identified types of bad behavior may be presented to the user with examples and recommendations. The list of identified types of bad behavior may be used for avoidance of bad behavior for future recordings. The list of identified types of bad behavior may be presented to the user as part of learning program presented to the user. The learning program may track user success in behavior improvement and repeats learning cycles as needed.
The proposed system offers a mechanism for recognizing and mitigating displays of unacceptable behavior by a user who is a presenter of an immersive video presentation recorded by the user; the system prevents the user from touching the face of the user by identifying a pool of capturing devices and techniques capable of continuous monitoring of the user; assembling a recognition and tracking technology stack for detecting manifestations of unacceptable behaviors or attempts by the user to touch the face of the user; continuously training recognition and tracking technologies to improve recognition speed and accuracy; warning and notifying the user about manifestations and attempts of unacceptable and unhealthy behavior, and creating a list of unacceptable behaviors by the user; offering the user a learning course with recommendations on avoiding unacceptable behavior, and presenting the user with episodes of unacceptable behavior at the end of each recorded presentation, allowing the user to edit the presentation and eliminate the episodes of unacceptable behavior.
System Functioning is Explained in More Detail Below as Follows:
The system may combine several technologies to recognize and track unacceptable behaviors. For example, an act of yawning may require a combination of facial recognition technology, gesture recognition technology (a mouth covering gesture representative of a portion of users) and a sound recognition technology (a characteristic yawning sound).
Table 1 exemplifies combinations of technologies used for recognition and tracking of certain acts of bad habits.
| TABLE 1 |
| Examples of combined technologies for recognizing displays of bad habits |
| Technology 1 | Technology 2 | Technology 3 |
| Bad habit | Name | Application | Name | Application | Name | Application |
| Yawning | Facial | Opening | Sound | Yawning | Gesture | Covering |
| recognition | mouth | recognition | sound | recognition | mouth | |
| Touching | Facial | Adding | Hand | Estimating | ||
| face | recognition | hotspots | movement | final | ||
| (nose, mouth, | recognition | finger | ||||
| eyes) | position | |||||
| Digging | Facial | Adding nose | Hand | Estimating | ||
| nose | recognition | hotspot | movement | final | ||
| recognition | finger | |||||
| position | ||||||
It should be noted that in each of the mobile scenarios B.a.-B.c., the system software may run on a smartphone of the user, communicating with sensors on wearable devices via Bluetooth, Wi-Fi, or other wireless connection. The smartphone may also be used to sound or display alarms.
Embodiments of the system described herein will now be explained in more detail in accordance with the figures of the drawings, which are briefly described as follows.
FIGS. 1A-1D are schematic illustrations of the system architecture and components, according to an embodiment of the system described herein.
FIGS. 2A-2B are schematic illustrations of detecting and mitigating episodes of unacceptable behavior and collecting training and user learning information during an immersive video recording, according to an embodiment of the system described herein.
FIGS. 3A-3D are schematic illustrations of various face touch outcomes captured by a front-facing camera of a notebook, according to an embodiment of the system described herein.
FIGS. 4A-4D are schematic illustrations of dynamic facial and hand movement recognition, of trajectory and risk assessment, and of user alarm options, according to an embodiment of the system described herein.
FIG. 5 is a schematic illustration of a face touch identification technology for mobile users with a smart watch, according to an embodiment of the system described herein.
FIGS. 6A-6B are schematic illustrations of face touch alarms for mobile users with smart eyeglasses and a smart ring featuring proximity sensors, according to an embodiment of the system described herein.
FIG. 7 is a system flow diagram illustrating system functioning in connection with detection and mitigation of unacceptable user behavior episodes and collecting training and user learning information during an immersive video recording, according to an embodiment of the system described herein.
FIG. 8 is a system flow diagram illustrating system functioning in connection with preventing desktop and notebook users from touching their faces, according to an embodiment of the system described herein.
The system described herein offers a variety of mechanisms for recognizing and mitigating episodes of unacceptable behavior by a user recording an immersive video presentation and for preventing users from touching their faces.
FIGS. 1A-1D are schematic illustrations of the system architecture and components.
FIG. 1A shows system architecture and includes five components of the system: a pool of capturing devices 110, a list of bad habits (unacceptable behaviors) 120, a technology stack 130, a feedback, mitigation, and user learning component 140, and an analytics and machine learning component 150. Each component is exemplified with several characteristic items (periods at the bottom of components, like periods 105 of the capturing devices component 110, indicate that more items are available for the component in FIGS. 1B-1D):
Displays of unacceptable behaviors from the list 120 are captured by devices from the capturing devices 110, as shown by an arrow 104, recognized and tracked by technologies from the stack 130, as shown by an arrow 105, and are forwarded to the feedback, mitigation, and the user learning component 140, as shown by an arrow 106, which may take various actions as follows:
Data on recognition of episodes of unacceptable behavior, potentially including raw recorded video, audio, sensor and other data, user feedback, system and learning actions may be transferred from the user learning component 140 to the analytics and machine learning component 150, as shown by an arrow 107. Subsequent data processing, using portions of raw data as training materials and incremental post-recording machine learning, as well as user post-processing of episodes of unacceptable behavior are explained elsewhere herein.
FIG. 1B shows an extended list 110′ of capturing devices. In addition to the previous listing of the laptop 112 and the smart speaker 113, the extended list 110′ may include mobile devices 114 (smartphones, tablets, etc.), autonomous and add-on cameras 115, headsets and regular speakers 116, smart watches and wristbands 117, smart rings 118 and wearable (as well as standalone) sensors 119. The extended list 110′ may also include other types of wearable devices, such as smart eyewear, heads-up displays, headbands, smart clothing, footwear, etc.
FIG. 1C shows an extended list 120′ of unacceptable behavior, adding to the bad habits 121-124 that were previously listed in FIG. 1A the following five habits: yawning 125, blowing one's nose 126, combing one's hair in public 127, slouching 128 and looking away from the screen (during an immersive video recording) 129. There are many more examples of unacceptable behaviors outside the extended list 120′.
FIG. 1D provides an extended list 130′ of the technology stack 130 beyond the four technologies 131-134 exemplifier in FIG. 1A; the list 130′ adds hand movement recognition 135, speech recognition 136 (different from the sound recognition technology 132), general image recognition 137 (different from the facial recognition technology 131), and sentiment recognition 138.
FIGS. 2A-2B are schematic illustrations of detecting and mitigating episodes of unacceptable behavior and collecting training and user learning information during an immersive video recording.
In FIG. 2A, a user 210 is recording an immersive video presentation with presentation materials (slides) 220. The user 210 may change the appearance (such as position and size) of an image 215 of the user 210. In FIG. 2A, the immersive video presentation is being recorded at an early stage of user recording history when the recognition accuracy has not reached high levels.
During the recording, four technologies from the extended technology list 130′: the sound recognition technology 132, the facial recognition technology 131, the hand movement recognition 135 and the general image recognition 137 have jointly detected an episode 230 of bad behavior associated with blowing one's nose 126 when the user 210 is loudly blowing their nose. Specifically, the sound recognition technology 132 may detect a sound specific for blowing one's nose; the facial recognition technology 131 may recognize a characteristic facial expression for an act of blowing one's nose (which may differ from person to person and may require an adaptation of the facial recognition technology to specific facial expression patterns of the user 210); the hand movement recognition 135 and the general image recognition 137 may detect bringing a handkerchief to the nose of the user 210. Note that the slide 225 has changed compared with the initial slide 220 and a position and size of an image 217 of the user 210 have also changed to a different position and size of an image 219 of the user 210 within the episode 230.
Because the system is in an early stage of user recording history, recognition accuracy for blowing one's nose 126 is not necessarily high, as indicated by question marks within the frames of the episode 230. Therefore, the feedback, mitigation, and the user learning component 140 checks the validity of recognition with the user 210 by sounding or displaying the alarm 144 and displaying the notification 141, requesting for the confirmation of blowing one's nose 126. The user 210 confirms the correct recognition, as indicated by a confirmation sign 142′, and the episode 230 is placed into the confirmed material area of the training database 143; the information may further be used for machine learning 152 or at the user learning stage 146, where the episode of unacceptable behavior of blowing one's nose 126 may augment the list 147 of all unacceptable behaviors of the user 210 presented to the user during learning, as explained elsewhere herein.
In FIG. 2B, a user 240 is recording an immersive video presentation with presentation materials 250. The immersive video presentation is being recorded at a mature stage of user recording history when the recognition accuracy is high and identification of unaccepted behaviors is confident. Through the course of the recording, two technologies, the sound recognition technology 132 and the facial recognition technology 131, jointly detect a bad behavior episode 260 with the yawning 125 (the gesture recognition technology shown in Table 1 in conjunction with yawning detection was not used in this example, because the user 240 did not cover the mouth of the user 240 with the hand of the user 240). Analogously to FIG. 2A, the presentation environment changes through the episode 260, as shown by two items 245, 255.
At the end of the recording, the feedback, mitigation, and the user learning component 140 delivers the obtained information on the unacceptable behavior to different destinations, as explained in Section 6 (b), i-ii of the Summary:
FIGS. 3A-3D are schematic illustrations of various face touch outcomes captured by the front-facing camera 112a of the notebook 112 (see also FIG. 1A).
In FIG. 3A, an outcome 310a is favorable: hand(s) of a user 320a is (are) not detected in a video frame 330a.
In FIG. 3B, an outcome 310b is, generally speaking, unfavorable: a frame 330b shows a user 320b touching a cheek 340b with a first 350b. However, the touch illustrated in FIG. 3B causes less issue with transmitting germs or viruses, so the outcome 310b is marked as a moderate outcome.
Both outcomes 310c, 310d in FIGS. 3C, 3D appear highly unfavorable and are marked as such: in FIG. 3C, a video frame 330c shows a user 320c touching an eye 360c with a finger 370c, while in FIG. 3D a video frame 330d displays a user 320d putting fingers 380d into a mouth 390d of the user 320d.
FIGS. 4A-4D are schematic illustrations of dynamic facial and hand movement recognition, of trajectory and risk assessment, and of user alarm options.
FIG. 4A shows a general layout and usage of facial recognition. The camera 112a of the notebook 112 captures a series of video frames 330a; for each frame, the facial recognition technology 131 identifies a face 320a and identifies positions of several hotspots, or touch risk zones 420 on the face 320a, typically corresponding to mouth, nose, eyes, and ears.
FIG. 4B illustrates functioning of dynamic facial and hand movement recognition in a secure situation with no alarms. Usage of the facial recognition technology 131 is explained in conjunction with the FIG. 4A. Features of the hand movement recognition technology 135 (see FIG. 1 for item enumeration) may include recognizing a dynamically changing shape of a moving hand (fist, palm, fingers), a predominant direction and speed of the moving hand, estimating a trajectory of the moving hand, etc. The system may reserve an alert zone (also called a capture zone) 440 within a video frame 430; the alert zone 440 is proximal to a facial image 320a′ of a user. When an image of a hand of the user is detected in the capture zone 440, the hand movement recognition technology 135 estimates a current direction 450a of the hand and estimates a state and palm orientation 460a of the hand. Note that FIG. 4B illustrates a moving first of the user. The hand movement recognition technology 135 also estimates a trajectory of a central axis 470a of the moving hand, which in FIG. 4B does not cross an alert zone and accordingly does not cause an alarm, so a current user status 480a is considered safe.
FIG. 4C illustrates a functioning of dynamic facial and hand movement recognition in a situation with a minor face touching alarm. The capture zone 440 plays the same roles as in FIG. 4B; the facial recognition technology 131 and the hand movement recognition technology 135 also function similarly but are applied to a different situation. Once a hand of a user 320b′ is detected in the capture zone 440, the hand movement recognition technology 135 estimates a direction 450b of the hand and then estimates a shape and then a palm orientation of the hand. A direction and speed 460b are estimated, which causes a prediction that an open palm is moving toward a face of the user 320b′ and will soon cross the alert zone 440 and touch the face. The prediction activates the mitigation system, which sounds or displays a minor alert 490a, reflecting an assessment by the mitigation system that a final position of the palm touching the face of the user is relatively safe from the hygienic standpoint. As the palm touches the face, the alert may be modified to a different tone 495a, staying a minor alert, and corresponding to an assigned user status 480b.
FIG. 4D illustrates functioning of dynamic facial and hand movement recognition in a situation with a major face touching alarm. The roles of the capture zone 440 and of the facial recognition technology 131 and the hand movement recognition technology 135 are the same as in FIGS. 4B-4C. As previously, once a hand of a user 320c′ is captured in a video frame, the hand movement recognition technology 135 assesses a direction 450c of the hand; shortly after, the user moves open fingers in a direction of the face of the user 320c′, as shown by a pictogram 460c; the configuration shown in FIG. 4D is considered potentially the most harmful; based on an assessment 460c and an extrapolation of hand trajectory, the mitigation component immediately activates a low-volume alarm 490a. A next assessment 470c provided by the hand movement recognition system 135 confirms and further enhances the previous assessment, because a finger of the user 320c′ is well within an alert zone and approaches a hotspot (a touch risk zone provided by the facial recognition technology 131, as explained in conjunction with FIG. 4A) associated with an eye of the user 320c′. In response to the latter assessment, the alarm is elevated and upgraded to a state 495a. Subsequently, the user touches an eye with a finger, a status 480c is determined as dangerous and the alarm is set at a maximum level 495c.
FIG. 5 is a schematic illustration 500 of a face touch identification technology for mobile users with a smart watch. Technology adaptation to a particular user may include a training (machine learning) phase 510t for a user 520, an owner of a smart watch (or a smart wristband) 530 attached to a hand 540 of the user 520. An accelerometer or other movement sensor (or multiple sensors) 535 may capture trajectories and parameters (speed, rotation) of user hand movements 550a, 550b that may occasionally cause undesired events 560 corresponding to the user touching their face. Fragments of such trajectories may be used as a training material for a machine learning component 570, which may eventually produce a reliable classifier 580 for a usage phase 510u whereby fragments of trajectories 550c, obtained through measurements by sensor(s) 535′ of movements of a hand 540′ of an active user with an attached smart watch 530′, are converted into feature vectors (feature construction is established at the training phase) and sent to the classifier 580 used for predicting whether a particular fragment of the hand trajectory poses an increased risk of the user 520 touching their face with the moving hand and may activate alarms 590a, 590b warning the user 520 about an undesirable situation. An increase in alarm volume/urgency for the alarm 590b may be associated with classification progress when the classifier 580 receives feature vectors constructed for larger fragments of the movement trajectory as the moving hand approaches the face of the user 520.
FIGS. 6A-6B are schematic illustrations of face touch alarms for mobile users with smart eyeglasses and a smart ring featuring proximity sensors.
In FIG. 6A, a user 610 with a pair of smart eyeglasses 620 supplied with a proximity sensor automatically activates an alarm 650 every time when a passive hand 630 reaches a position 635 where the hand 630 is detected by the proximity sensor, as shown by an item 640.
In FIG. 6B, a user 610′ wears a smart ring 660 on a finger of an active hand 630′; the ring 660 has a built-in proximity sensor. The system automatically enables an alarm 680 every time the active hand 630′ reaches a position 635′ where the proximity sensor of the ring 660 detects the face of the user610′, as shown by an item 670.
Referring to FIG. 7, a system flow diagram 700 illustrates system functioning in connection with detection and mitigation of unacceptable user behavior episodes and collecting training and user learning information during an immersive video recording. Processing begins at a step 710, where a user pool of capturing devices is identified, as explained elsewhere herein (see, for example, FIGS. 1A, 1B and the accompanying text). After the step 710, processing proceeds to a step 712, where the system assembles recognition and tracking technology stack, corresponding to the pool of capturing devices (see FIGS. 1A, 1D and the accompanying text for details). After the step 712, processing proceeds to a step 715, where the system collects generic training material on bad habits (unacceptable behaviors) of user(s) utilizing captured data for multiple users and under different conditions. After the step 715, processing proceeds to a step 720, an initial machine learning phase, where the system trains recognition and tracking technologies, adapting the technologies to user specifics, as described elsewhere herein.
After the step 720, processing proceeds to a step 722, where an immersive video recording begins. After the step 722, processing proceeds to a step 725, where the system captures and tracks user behavior. After the step 725, processing proceeds to a test step 730, where it is determined whether a bad habit occurrence has been detected. If not, processing proceeds to a test step 732, where it is determined whether an end of recording has been reached. If not, processing proceeds to the step 725, which may be independently reached from the step 722. Otherwise, if an end of recording has been reached, processing proceeds to a step 780, where user analytics summary for the recording is compiled. After the step 780, processing proceeds to a step 782, where a list of bad habits, the analytics summary, and the bad habits episodes are delivered to the user. After the step 782, processing proceeds to a step 785, where the user edits video recording and deletes bad habit episodes, as explained elsewhere herein (see, for example, FIG. 2B and the accompanying text). After the step 785, processing proceeds to a step 790, where the system assesses user learning efficiency. After the step 790, processing proceeds to a test step 792, where it is determined whether user learning efficiency is sufficient. If so, processing is complete; otherwise, processing proceeds to a step 795, where user learning is continued. After the step 795, processing is complete.
If it is determined at the test step 730 that the bad habit occurrence is detected, processing proceeds to a test step 735, where it is determined whether the system is at an early bad habit recognition phase for the current user. If so, processing proceeds to a step 740, where the system sends the user a notification with a confirmation request, as explained elsewhere herein. After the step 740, processing proceeds to a test step 742, where it is determined whether the user returned a positive confirmation. If so, processing proceeds to a step 745, where a corresponding bad habit episode is added to the training data for machine learning. After the step 745, processing proceeds to a step 750, where the bad habit is added to the user list for user learning phase, as explained elsewhere herein (see FIG. 2A and the accompanying text). After the step 750, processing proceeds to a step 752, where the bad habit episode is marked in the video recording for subsequent processing and editing. After the step 752, processing proceeds to a step 755, where the user analytics database is updated. (Note that processing proceeds to the same step 755 from the test step 742 if it is determined that the user did not return a positive confirmation.) After the step 755, processing proceeds to a test step 760, where it is determined whether there is enough new training data. If so, processing proceeds to a step 762, where incremental machine learning based on the new training data is performed. After the step 762, processing proceeds to a step 765, where a user analytic database is updated. After the step 765, processing proceeds to the step 725, which may be independently reached from the step 722 and the test step 732. (Note that processing proceeds to the same step 725 if it is determined at the test step 760 that there is not enough new training data.)
If it is determined at the test step 735 that the system is not at an early bad habit recognition phase for the current user, processing proceeds to a step 772, where the bad habit episode is marked in the video recording and added to a list of other bad habit episodes for the current video recording (if any). After the step 772, processing proceeds to a test step 775, where it is determined whether the latest detected bad habit is on the user list. If not, processing proceeds to a step 777 where the bad habit is added to the user list for user learning purpose (see FIG. 2B and the accompanying text). After the step 777, processing proceeds to the step 725, which may be independently reached from the steps 722, 760 and the test step 732. (Note that processing proceeds to the same step 725 if it is determined at the test step 775 that there the current bad habit is already on the user list.)
Referring to FIG. 8, a system flow diagram 800 illustrates system functioning in connection with preventing desktop and notebook users from touching their faces. Processing begins at a step 810, where the system initiates or maintains (if previously initiated) face recognition technology and software, discussed above. After the step 810, processing proceeds to a step 815, where the face recognition software captures a user face from a video stream and dynamically detects touching of risk zones (mouth, nose, eyes, ears). After the step 815, processing proceeds to a step 820, where the system initiates or maintains hand movement recognition technology and software, including dynamic recognition of the hand state (hand, palm, fingers, fist, etc.). After the step 820, processing proceeds to a step 825, where the system continuously detects, scans and processes capture zones in subsequent video frames (for more information, see FIGS. 4B-4D and the accompanying text). After the step 825, processing proceeds to a test step 830, where it is determined whether a user hand is identified in the current capture zone. If so, processing proceeds to a step 835, where the hand movement recognition software detects the hand status (open, closed, pointing, etc.), palm orientation, direction, speed, and distance from the face.
After the step 835, processing proceeds to a step 840, where the hand trajectory is extrapolated by the system based on the previously processed hand movement. After the step 840, processing proceeds to a step 845, where the system assesses a dynamic risk level of unacceptable face touching based on the extrapolated hand trajectory. After the step 845, processing proceeds to a test step 850, where it is determined whether the risk level is alarming. If so, processing proceeds to a test step 855, where it is determined whether the captured image of the user hand has crossed boundaries of an alert zone (see FIGS. 4B-4D for explanations). If so, processing proceeds to a test step 860, where it is determined whether the alarm has already been initialized for the current episode of hand movement. If so, processing proceeds to a test step 865, where it is determined whether the risk level has increased compared with the risk level for the current alarm intensity. If so, processing proceeds to a step 872, where the alarm intensity is increased to correspond to the risk level. After the step 872, processing proceeds to a step 875, where the system plays the alarm. After the step 875, processing proceeds to a test step 880, where it is determined whether a hand of the user has touched the face of the user. If not, processing proceeds to a test step 882, where it is determined whether the observation period is over. If so, processing is complete; otherwise, processing proceeds to the step 825, which may be independently reached from the step 820.
If it is determined at the test step 880 that the user hand has touched the face, processing proceeds to a test step 885, where it is determined whether the touching is acceptable (see FIGS. 4C-4D and the accompanying text for more information). If so, processing proceeds to a step 890, where the system plays a warning signal (instead of initializing an alarm). After the step 890, processing is complete. If it is determined at the test step 885 that the touching is not acceptable, processing proceeds to a step 895, where the system continuously plays the maximum intensity alarm. After the step 895, processing is complete.
If it is determined at the test step 865 that the risk level has not increased compared to a level reflected in the then current alarm intensity, processing proceeds to the step 875, which may be independently reached from the step 872. If it is determined at the test step 860 that the alarm has not been initialized for the current episode, processing proceeds to a step 870, where the alarm is initialized. After the step 870, processing proceeds to the step 875, which may be independently reached from the test step 865 and the step 872. If it is determined at the test step 855 that the hand image does not cross the boundaries of the alert zone, processing proceeds to the step 825, which may be independently reached from the step 820 and the test step 882. If it is determined at the test step 850 that the assessed risk level of unacceptable face touching is not alarming, processing proceeds to the step 825, which may be independently reached from the step 820 and the test steps 855, 882. If it is determined at the test step 830 that a user hand has not been identified in the capture zone, processing proceeds to the test step 882, which may be independently reached from the test step 880.
Various embodiments discussed herein may be combined with each other in appropriate combinations in connection with the system described herein. Additionally, in some instances, the order of steps in the flowcharts, flow diagrams and/or described flow processing may be modified, where appropriate. Subsequently, system configurations and functioning may vary from the illustrations presented herein. Further, various aspects of the system described herein may be deployed on various devices, including, but not limited to notebooks, smartphones, tablets, and other mobile computers and on wearable devices. Smartphones and tablets may use operating system(s) selected from the group consisting of: IOS, Android OS, Windows Phone OS, Blackberry OS, and mobile versions of Linux OS. Notebooks and tablets may use operating system selected from the group consisting of Mac OS, Windows OS, Linux OS, Chrome OS.
Software implementations of the system described herein may include executable code that is stored in a computer readable medium and executed by one or more processors. The computer readable medium may be non-transitory and include a computer hard drive, ROM, RAM, flash memory, portable computer storage media such as a CD-ROM, a DVD-ROM, a flash drive, an SD card and/or other drive with, for example, a universal serial bus (USB) interface, and/or any other appropriate tangible or non-transitory computer readable medium or computer memory on which executable code may be stored and executed by a processor. The software may be bundled (pre-loaded), installed from an app store or downloaded from a location of a network operator. The system described herein may be used in connection with any appropriate operating system.
Other embodiments of the invention will be apparent to those skilled in the art from a consideration of the specification or practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with the true scope and spirit of the invention being indicated by the following claims.
1. A method of handling unacceptable behavior by a user making a video recording, comprising:
detecting the unacceptable behavior by the user while the user is making the recording by applying machine learning to data about the user received from one or more capturing devices and by using a predetermined list of bad habits;
determining recognition accuracy for a particular episode of the unacceptable behavior;
marking the particular episode in the video recording and adding the particular episode to a list of bad habit episodes for the video recording in response to recognition accuracy of the particular episode being high;
prompting the user for confirmation of the particular episode in response to the recognition accuracy of the particular episode being low; and
marking the particular episode in the video recording and adding the particular episode to the list of bad habit episodes for the video recording in response to the recognition accuracy of the particular episode being low and the user confirming the particular episode.
2. The method of claim 1, wherein the machine learning includes an initial training phase that, prior to deployment, is used to obtain a general recognition capability for each item on the predetermined list of bad habits having a low recognition accuracy that has been confirmed by the user.
3. A method, according to claim 1, wherein the confirmation provided by the user is used to improve the recognition accuracy of the machine learning.
4. The method of claim 1, wherein the one or more capturing devices include a laptop with a camera and a microphone, a mobile device, autonomous cameras, add-on cameras, headsets, regular speakers, smart watches, wristbands, smart rings, and wearable sensors, smart eyewear, heads-up displays, headbands, and smart footwear.
5. The method of claim 1, wherein the data about the user includes at least one of: visual data, sound data, motion, proximity and chemical sensor data, heart rate, breathing rate, and blood pressure.
6. The method of claim 1, wherein the list of bad habits includes nail biting, yelling, cursing, making unacceptable gestures, digging one's nose, yawning, blowing one's nose, combing one's hair, slouching, and looking away from a screen.
7. The method of claim 6, wherein technologies used to detect bad habits include facial recognition, sound recognition, speech recognition, gesture recognition, and hand movement recognition.
8. The method of claim 7, wherein yawning is detected using a combination of the facial recognition technology, the sound recognition technology, and the gesture recognition technology.
9. The method of claim 7, wherein face touching is detected using the facial recognition technology and hand movement recognition technology.
10. The method of claim 7, wherein digging one's nose is detected using the facial recognition technology and hand movement recognition technology.
11. The method of claim 1, wherein at least some bad habit episodes from the list of bad habit episodes are presented to the user.
12. The method of claim 11, wherein the user deletes at least one of the bad habit episodes from the video recording.
13. The method of claim 12, wherein a portion of the video recording corresponding to the at least one of the bad habit episodes is re-recorded by the user.
14. The method of claim 1, wherein a list of identified types of bad behavior is presented to the user with examples and recommendations.
15. The method of claim 14, wherein the list of identified types of bad behavior is used for avoidance of bad behavior for future recordings.
16. The method of claim 14, wherein the list of identified types of bad behavior is presented to the user as part of learning program presented to the user.
17. The method of claim 16, wherein the learning program tracks user success in behavior improvement and repeats learning cycles as needed.
18. A method of preventing a user from touching a face of the user, comprising:
obtaining video frames of the user including the face of the user;
applying facial recognition technology to the video frames to detect locations of particular portions of the face;
detecting a position, shape, and trajectory of a moving hand of the user in the video frames;
predicting a final position and shape of the hand based on the position, shape, and trajectory of the hand and on the locations of the specific portions of the face; and
providing an alarm to the user in response to predicting that a final position of the hand will be touching the face of the user to prevent the user from touching the face of the user, wherein predicting a final position of the hand includes determining if the hand crosses an alert zone that is proximal to the face.
19. The method of claim 18, wherein the alarm varies according to a predicted final shape of the hand and according to predicting that a final position of the hand will be touching a specific one of the particular portions of the face.
20. The method of claim 19, wherein the predicted final shape of the hand is one of: an open palm and open fingers.
21. A method of preventing a user from touching a face of the user, comprising:
detecting a position, shape, and trajectory of a moving hand of the user based on one or more sensors from a wearable device of the user;
predicting a final position and shape of the hand based on the position, shape, and trajectory of the hand and on the locations of the specific portions of the face; and
providing an alarm to the user in response to predicting a final position of the hand will be touching the face of the user to prevent the user from touching the face of the user, wherein predicting the final position and shape of the hand uses machine learning that is adapted to the user during a training phase.
22. The method of claim 21, wherein the wearable device is a smart watch and at least one of the position, shape, and trajectory of the moving hand is determined using sensors of the smart watch.
23. The method of claim 21, wherein the wearable device is a smart ring and at least one of the position, shape, and trajectory of the moving hand is determined using a proximity sensor, an accelerometer or a gyroscope of the smart ring.
24. The method of claim 21, wherein the wearable device is smart glasses and at least one of the position, shape, and trajectory of the moving hand is determined using a proximity sensor of the smart glasses.
25. A non-transitory computer readable medium containing software that handles unacceptable behavior by a user making a video recording, the software comprising:
executable code that detects the unacceptable behavior by the user while the user is making the recording by applying machine learning to data about the user received from one or more capturing devices and by using a predetermined list of bad habits;
executable code that determines recognition accuracy for a particular episode of the unacceptable behavior;
executable code that marks the particular episode in the video recording and adds the particular episode to a list of bad habit episodes for the video recording in response to recognition accuracy of the particular episode being high;
executable code that prompts the user for confirmation of the particular episode in response to the recognition accuracy of the particular episode being low; and
executable code that marks the particular episode in the video recording and adds the particular episode to the list of bad habit episodes for the video recording in response to the recognition accuracy of the particular episode being low and the user confirming the particular episode.