Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings