Details for Patent: 9,805,270

✉ Email this page to a colleague

Title:	Video segmentation techniques
Abstract:	A video segmentation system can be utilized to automate segmentation of digital video content. Features corresponding to visual, audio, and/or textual content of the video can be extracted from frames of the video. The extracted features of adjacent frames are compared according to a similarity measure to determine boundaries of a first set of shots or video segments distinguished by abrupt transitions. The first set of shots is analyzed according to certain heuristics to recognize a second set of shots distinguished by gradual transitions. Key frames can be extracted from the first and second set of shots, and the key frames can be used by the video segmentation system to group the first and second set of shots by scene. Additional processing can be performed to associate metadata, such as names of actors or titles of songs, with the detected scenes.
Inventor(s):	Carlson; Adam (Seattle, WA), Gray; Douglas Ryan (Redwood City, CA), Kulkarni; Ashutosh Vishwas (Bellevue, WA), Taylor; Colin Jon (Orinda, CA)
Assignee:	Amazon Technologies, Inc. (Reno, NV)
Filing Date:	Sep 02, 2016
Application Number:	15/255,978
Claims:	1. A computer-implemented method, comprising: determining a feature for a frame of a plurality of frames of a video; analyzing a similarity between the feature and at least one feature associated with adjacent frames to the frame to determine a first shot of the video; determining that the first shot meets a time threshold; determining that a similarity metric between a first frame of the first shot and a second frame of the first shot meets a dissimilarity threshold; determining that a similarity matrix of at least a subset of frames of the first shot corresponds to a dissolve pattern, the subset of frames corresponding to at least one second shot of the video; generating a graph of the video, the graph comprising nodes corresponding to the first shot and the at least one second shot; and determining a grouping of the first shot and the at least one second shot by performing one or more cuts of the graphs. 2. The computer-implemented method of claim 1, wherein analyzing similarity between the respective features for adjacent frames further includes: determining respective cosine similarity between the respective features for the adjacent frames; and comparing the respective cosine similarity between the respective features for the adjacent frames to a similarity threshold. 3. The computer-implemented method of claim 1, wherein determining that the similarity matrix of at least the subset of frames of the first shot corresponds to the dissolve pattern further includes: generating the dissolve pattern; sliding the dissolve pattern along a diagonal of the similarity matrix; and matching the dissolve pattern to at least one portion of the diagonal. 4. The computer-implemented method of claim 1, wherein determining the respective features for each frame further includes: determining a first histogram for the frame; determining a first plurality of histograms for first portions of the frame; and determining a second plurality of histograms for second portions of the frame. 5. The computer-implemented method of claim 1, wherein determining the grouping of the first shot and the at least one second shot includes: obtaining one or more respective key frames for the first shot and the at least one second shot, wherein the nodes of the graph correspond to the respective key frames. 6. The computer-implemented method of claim 5, wherein edges of the graph correspond to a respective cost between the nodes, and wherein the respective cost is based on a function of time and visual similarity. 7. The computer-implemented method of claim 1, further comprising: obtaining one of an audio feature corresponding to the video or a text feature corresponding to the video, wherein the grouping is further based at least in part on one of the audio feature or video feature. 8. The computer-implemented method of claim 1, further comprising: detecting a face in the grouping; determining an identity of the face; and associating the identity with the grouping. 9. The computer-implemented method of claim 1, further comprising: detecting one of textual data corresponding to the grouping or music in the grouping, the music associated with a title; and associating one of the textual data or the title with the grouping. 10. The computer-implemented method of claim 1, further comprising: analyzing visual content of the first shot; and classifying the first shot as one of a dissolve shot, a blank shot, a card credit, a rolling credit, an action shot, or a static shot. 11. A non-transitory computer-readable storage medium comprising instructions that, upon being executed by a processor of a computing device, cause the computing device to: determine a feature for a frame of a plurality of frames of a video; analyze a similarity between the feature and at least one feature associated with adjacent frames to the frame to determine a first shot of the video; determine that the first shot meets a time threshold; determine that a similarity metric between a first frame of the first shot and a second frame of the first shot meets a dissimilarity threshold; determine that a similarity matrix of at least a subset of frames of the first shot corresponds to a dissolve pattern, the subset of frames corresponding to at least one second shot of the video; generate a graph of the video, the graph comprising nodes corresponding to the first shot and the at least one second shot; and determine a grouping of the first shot and the at least one second shot by performing one or more cuts of the graphs. 12. The non-transitory computer-readable storage medium of claim 11, wherein the instructions, upon being executed, further cause the computing device to: associate metadata with the grouping; and enable a user to navigate to the grouping based on the metadata. 13. The non-transitory computer-readable storage medium of claim 12, wherein the metadata corresponds to at least one of an identity of an actor appearing in the at least one grouping, title of music playing in the at least one grouping, a representation of an object in the at least one grouping, a location corresponding to the at least one grouping, or textual data corresponding to the at least one grouping. 14. The non-transitory computer-readable storage medium of claim 11, wherein the instructions, upon being executed to determine that the similarity matrix of at least the subset of frames of the first shot corresponds to the dissolve pattern, further cause the computing device to: generate the dissolve pattern; slide the dissolve pattern along a diagonal of the similarity matrix; and match the dissolve pattern to at least one portion of the diagonal. 15. The non-transitory computer-readable storage medium of claim 11, wherein the instructions, upon being executed to analyze similarity between the respective features for adjacent frames, further cause the computing device to: determine respective cosine similarity between the respective features for the adjacent frames; and compare the respective cosine similarity between the respective features for the adjacent frames to a similarity threshold. 16. A computing device, comprising: a processor; memory including instructions that, upon being executed by the processor, cause the computing device to: determine a feature for a frame of a plurality of frames of a video; analyze a similarity between the feature and at least one feature associated with adjacent frames to the frame to determine a first shot of the video; determine that the first shot meets a time threshold; determine that a similarity metric between a first frame of the first shot and a second frame of the first shot meets a dissimilarity threshold; determine that a similarity matrix of at least a subset of frames of the first shot corresponds to a dissolve pattern, the subset of frames corresponding to at least one second shot of the video; generate a graph of the video, the graph comprising nodes corresponding to the first shot and the at least one second shot; and determine a grouping of the first shot and the at least one second shot by performing one or more cuts of the graphs. 17. The computing device of claim 16, wherein the instructions, upon being executed include causing the computing device to: determine respective cosine similarity between the respective features for the adjacent frames; and compare the respective cosine similarity between the respective features for the adjacent frames to a similarity threshold. 18. The computing device of claim 16, wherein the instructions, upon being executed to determine that a similarity matrix of at least the subset of frames of the first shot corresponds to the dissolve pattern include causing the computing device to: generate the dissolve pattern; slide the dissolve pattern along a diagonal of the similarity matrix; and match the dissolve pattern to at least one portion of the diagonal. 19. The computing device of claim 16, wherein the instructions, upon being executed to determine the respective features for each frame include causing the computing device to: determine a first histogram for the frame; determine a first plurality of histograms for first portions of the frame; and determine a second plurality of histograms for second portions of the frame. 20. The computing device of claim 16, wherein the instructions, upon being executed to determine the grouping of the first shot and the at least one second shot include causing the computing device to: obtain one or more respective key frames for the first shot and the at least one second shot, wherein the nodes of the graph correspond to the respective key frames, wherein edges of the graph correspond to a respective cost between the nodes, and wherein the respective cost is based on a function of time and visual similarity.

Make Better Decisions: Try a trial or see plans & pricing