HST a fast code for anomaly detection in time series

Posted on 16/07/2022, in English. Reading time: 9 mins

HOT SAX Time (HST): a fast algorithm for discord search.

HST is a fast algorithm that calculates anomalies (discords) in time series. It is a form of unsupervised machine learning.

Download here the HST code called: onlineHS147_clean.f90.

Download here the file that controls the parameters of the search, called: input.dat.

Download here a file with a ECG time series (with more than 500k points) used in the article: 300_signal1.txt (you should be able to find it at the Rare Rule Anomaly webpage too).

We first released the article explaining HST on the ArXive in 2021, recently it also appeared in Applied Intelligence journal. Please, if you use the code, cite the article as described at the bottom of the webpage.

Avogadro, P., Dominoni, M.A. A fast algorithm for complex discord searches in time series: HOT SAX Time. Appl Intell 52, 10060–10081 (2022). https://doi.org/10.1007/s10489-021-02897-z

What is a Discord? It is a specific kind of anomaly, from an intuitive point of view, a discord is a sequence whose nearest neighbor sequence has the highest distance.

Just imagine that there is a town and you want to find the anomalous house. A good idea is to find the most isolated house. That can be translated in the house whose nearest neighbor is far away. You can find a more detailed explanation of what is a discord in our paper (where there are also the references to the original article that defined Discords for time series and a fast algorithm to find them: HOT SAX.

Performance

HST can be up to 100 times faster than its “parent” algorithm HOT SAX. The computational time is approximately proportional to the number of discords searched and to the size of the time series. HST returns exact discords (it is not an approximate algorithm), but it follows a heuristic subject to randomization. For this reason the computational times change slightly for each run. You can expect variations of the runtime of about 10-20% from one run to the other. Most of the time is spent by calculating distances among sequences.

Disclaimer

WARNING: this code comes with no warranty, it might create problems and damages to your computer. Please use it at your own risk. The main problem that I had was due to the creation of large support files like those creating the approximate matrix profile. This version of the code has also other minor bugs.

Notice that the present version of the code does not build the file with all the approximate NNDs since this takes space and time. When tating into account the execution time, you should be aware of the fact that, just reading the time series might take time. For example it takes 2-3 s to read a 8M point time series. Keep this in mind when you compare the performance of the search with other codes. Fortran unformatted files are much faster to read and they take much less space (approx 1/3). However the present version of the code does not read unformatted Fortran files.

Compiling the Code

The present code is written in Fortran. Be aware that it contains a lot of subroutines no longer in use, the code should be cleaned better. It is a rather old version and I know it contains a few bugs, but they should be minor. You need a Fortran compiler (for example, I usually use Gfortran, since it creates quick codes and it is free). You can find a list of links for Fortran compilers at the Fortran webpage (including the Gfortran).

Here I give you the instruction on how to compile the code using as a reference the gfortan compiler. I will suppose that you are using a Linux distribution to compile and run the code. You should go to the folder where there is the code (It should be onlineHS147_clean.f90, but I might insert other versions of the code in the future), open a terminal and write:

gfortran -O3 onlineHS147_clean.f90 -o onlineHS147_clean.x

The result of the compilation is a bunch of support files and the new executable: online147_clean.x

The input file

At this point you need to configure the discord search with the input.dat file:

      eXl,          points for each letter
       lung_parole,  number of letters for Symbolic sequence
       num_simboli,  size of the alphabet  
       Nd,           number of discords you want for each present_queue
-1        N,            number of points for each present queue (-1 takes the whole time series)
       nmax,         number of present_queues to be analyzed
    number of elements in coda_discord
   Tmax          distance (points) before obsolescence
       nnei          fossil parameter do not change 
     nbin          numero di bins for the statistics regarding the symbolic sequences
T         Znorm         z-normalizzazione T= True, F= False
altro     algorithm name: altro (HST), brute (brute force), ...  

HST allows you to search for the discord of length m within the time series (where m is the number of points of the sequence) You do not enter directly m in this input file, but rather you need to adjust the eXl and lung_parole parameters so that their product is equal to the m you need. This is due to the fact that HST is derived from HOT SAX, so you should enter the Symbolic Aggregate Approximation parameters, that turn each sequence into a symbolic seuquence.

m = eXl x lung_parole

This is a limitation: you cannot search for sequences that have a length that is a prime number since it has to be the product of two integers (>2). For example, you cannot find discords of lenght 43 with this implementation of the code; while if you need m =300 you can pick:

eXl = 75
lung_parole = 4

It is rather safe to keep num_simboli = 3, 4 or 5.

Nd is the number of discords that you need, you could enter 1, 2, 3,…. As long as m x Nd « N the calculation should complete succesfully.

nmax is used to simulate an online analysis on a batch file. It essentially divides the batch in nmax pieces.

The number of elements in coda discord is the size of the batches in the case of online searches.

Tmax is the obsolescence parameter (also used for online analysis)

nnei is a fossil of old implementations, it should do nothing, in any case it is better to not change it.

nbin is a parameter that is used for the statistics on the symbolic sequences, it is better to not change it.

Znorm tells the computer to calculate the Z-normalized Euclidean distance (if T ) or the regular Euclidean distance (when its value is F)

The last parameter to be changed is the algorithm name. Curiosly when you want to run the HST algorithm, you should write ‘altro’ (it means other in Italian, HST was a derivation of HOT SAX, and I never changed this name)

For example, the input.dat file shown above works fine when searching for 6 discords of length 125 in a time series of unknown length using the HST algorithm, considering Z-norm distances:

Run a discord search.

The code is expected to read and process files where the points of a time series one on top of the other in a txt like file. A file that can be processed by HST could look like this:

-0.215
-0.24
-0.24
-0.225
-0.225
-0.2
-0.23
-0.28
-0.365
-0.41
-0.455
-0.5

...

In order to run the search you just need the executable file (and its support files created at compilation time) to be in the same folder where there is the input.dat file. Let’s suppose that the time series is in a file called 300_signal1.txt (in the same folder), you open a terminal in it, and you type:

time ./onlineHS147_clean.x 300_signal1.txt

By using the command time, you will get information regarding the execution time.

HST will first display a some information regarding the search and it will start processing and writing information files in the same folder (and in two sub-folders it creates).

At the end of the execution it will display the first discord and its distance from the nearest neighbor. You can find the remaining discords in the file that begins with anomalieTrovate_…. The file name contains some information regarding the run, inside there is the position of the anomaly and the distance with the nearest neighbor.

Take into account that in computer science the first sequence is usually associated with position 0, while for me the first sequence is at position 1, so there might be 1 point difference between my results and those of other discord codes.

Here is an example of a run (sorry the comments displayed by HST are is in Italian):

 lunghezza sequenze da analizzare  (lung_parole x eXl)  =         125
 numero di punti per lettera                     (eXl)  =          25
 num. lettere della sequenza simbolica   (lung_parole)  =           5
 grandezza alfabeto PAA                  (num_simboli)  =           4
 Numero di possibili cluster: (num_simboli^lung_parole) =        1024
 Lunghezza della present queue                     (N)  =      536976
 Numero di discord cercati per coda               (Nd)  =           6
 Numero di passi prima dell'obsolescenza        (Tmax)  =       60000
 Numero di cluster vicini dove ricercare        (nnei)  =           0
 Numero bin della statistica seq simb           (nbin)  =         100
 Usa Z-normalization?                           (Znorm) = T
 Usa seme random?                       (useRandomSeed) = T
 Usa breakpoint Gaussiani?            (gaussBreakpoints)= F
 Metodo risparmia memoria?            (risparmiaMemoria)= T
 Multi SAX?                                      (musax)= F

  algoritmo:        altro - (SOD non hanno senso)

  eseguibile:        ./onlineHS147_clean.x

  file analizzato:   300_signal1.txt

 Pre riscaldamento   536726.00000000000     
 In pulisci          563628.00000000000     
  Percentuale Completamento                                1   534758.00000000000     
           3   536785.00000000000     
97%( 1)            2   537603.00000000000     
           6   535760.00000000000     
           8   536635.00000000000     
          10   535130.00000000000     
          99   728597.00000000000     
97%( 2)            9   536611.00000000000     
          17   539073.00000000000     
          21   539236.00000000000     
          22   536782.00000000000     
          60   548933.00000000000     
97%( 3)           19   536589.00000000000     
          28   538722.00000000000     
          36   535607.00000000000     
          44   638288.00000000000     
          49   537219.00000000000     
         126   715451.00000000000     
97%( 4)           61   537133.00000000000     
          65   535747.00000000000     
          67   538952.00000000000     
          68   535054.00000000000     
          75   538918.00000000000     
97%( 5)           77   536836.00000000000     
          86   537931.00000000000     
          91   539489.00000000000     
          97   536377.00000000000     
97%( 6) 
  Entropia di Shannon    5.0499323517087751     
 
  Chiamate a distanza      16095022.0
  primo Discord        =       441745 ,  nnd=   12.380154963862225     

real	0m15.730s
user	0m14.833s
sys	0m0.410s