Efficient generalized boundary detection

DWPI Title: Method for detecting boundary within arbitrary dataset i.e. computer network traffic by using compression-based analytics and computer, involves identifying boundary within dataset at potential border corresponding to anomalous fixed SLID score
Abstract: Fast, efficient, and robust compression-based methods for detecting boundaries in arbitrary datasets, including sequences (1D datasets), are desired. The methods, each employing three simple algorithms, approximate the information distance between two adjacent sliding windows within a dataset. One of the algorithms calculates an initial ordered list of subsequences; while a second algorithm updates the ordered list of subsequences by dropping a first entry and appending a last entry rather than calculating completely new ordered lists with each iteration. Large values in the distance metric are indicative of boundary locations. A smoothed z-score or a wavelet-based algorithm may then be used to locate peaks in the distance metric, thereby identifying boundary locations. An adaptive version of the method employs a collection of window sizes and corresponding weighting functions, making it more amenable to real datasets with unknown, complex, and changing structures.
Use: Method for detecting boundary within arbitrary dataset such as computer network traffic, text and audio signals or sequence 1D dataset by using compression-based analytics and a computer.
Advantage: The method enables providing fast, accurate, and robust approximation to the normalized information distance (NID) in an effective manner.
Novelty: The method involves initializing a set of fixed sliding information distance (SLID) scores to zero. A first sub-dataset window and a second sub-dataset window are generated, where the second sub-dataset window is placed adjacent the first sub-dataset window at a potential border, and first sub-dataset window and second sub-dataset window are respective sub-dataset of the dataset. An initial ordered list of subsequences for each token is computed in first sub-dataset window and the second sub-dataset window, if the potential border is determined as an initial potential border. Corresponding fixed SLID score is computed at the potential border based on computed initial or subsequent ordered list of subsequences. The subsequent ordered list of subsequences are computed. Corresponding fixed SLID score is computed. The potential border for all potential borders is incremented. Boundary within the dataset is identified at the potential border corresponding to an anomalous fixed SLID score.
Filed: 3/31/2021
Application Number: US17219217A
Tech ID: SD 15113.1
This invention was made with Government support under Contract No. DE-NA0003525 awarded by the United States Department of Energy/National Nuclear Security Administration. The Government has certain rights in the invention.
Data from Derwent World Patents Index, provided by Clarivate
All rights reserved. Republication or redistribution of Clarivate content, including by framing or similar means, is prohibited without the prior written consent of Clarivate. Clarivate and its logo, as well as all other trademarks used herein are trademarks of their respective owners and used under license.