zhaokg / Rbeast

Bayesian Change-Point Detection and Time Series Decomposition
208 stars 36 forks source link

Writing BEAST output to disk not working #2

Closed IHassall closed 2 years ago

IHassall commented 4 years ago

I tried to write the BEAST output to disk using: option$outputToDisk=1 option$outputFolder='C:/folder' But when the optional parameters were printed on starting BEAST the output folder had changed to "outputFolder=Ñ0zl" "Writing output..." was printed on the console but no files were written in the specified output folder. I searched for a temporary folder named Ñ0zl, but found nowhere that BEAST was writing the output to. My friend and I tried using different folders, but instead both got "outputFolder=aKzl" on different data. I tried to look at the scripts underlying beast, but couldn't figure out where the error was.

It would also be good to know where the outputs "marg_lik" and "sig2" have come from and what they mean, as this isn't outlined in the Rbeast package guidance.

zhaokg commented 2 years ago

Hi IHassall, Sorry for not get a chance to see your issue report earlier and for my belated reply. Ia am sure that my reply here is not relevant to your use case now. Regardless, thanks a lot for letting me know the issue. Apologies for all the confusion.

In the original version, I did implement the input of data and output of results directly to files on a harddisk, but that seemed to be a very bad idea because the formats of the inputs/outputs can be meaningless to users. So, the outfolder parameter is a residual from such attempts. So, please ignore it and the version released now doesn't support it.

Instead, it should suffice to directly dump the output via R's or Matlab's save command. The saved data can be re-loaded via the load command. Not sure about your use cases. I assume you wanted to handle many time series (e.g. , multiple time seires of the same length or stacked time-series images). In the new version, I implemented an interface to handle multiple or image time-series data. Below is one example:

 library(Rbeast)
 data(imagestack)
 dim(imagestack$ndvi)               # Dim: 12 x 9 X 1066 (row x col x time)
 imagestack$datestr                   # A character vector of 1066 date strings

 metadata                  = list()        
 metadata$isRegularOrdered = FALSE  # 'imagestack$ndvi' is an IRREGULAR input
 metadata$whichDimIsTime   = 3         # Which dim of the input refer to time for 3D inputs?
                                                              # 1066 is the ts length, so dim is set to '3' here.
 metadata$time$datestr     =  imagestack$datestr 
 metadata$time$strfmt      =  'LT05_018032_20080311.yyyy-mm-dd'
 metadata$deltaTime        =  1/12  # Aggregate the irregular ts at a monthly interval:1/12 Yr
 metadata$period           =  1.0       # The period is 1 year: deltaTime*freq=1/12*12=1.0

 extra = list()
 extra$dumpInputData        = TRUE  # Get a copy of aggregated input ts 
 extra$numThreadsPerCPU     = 2     # Each cpu core will be assigned 2 threads
 extra$numParThreads        = 0        # If 0, total_num_threads=numThreadsPerCPU*num_of_cpu_core
                                                         # if >0, used to specify the total number of threads

 # Default values for missing parameters 
 o=beast123(imagestack$ndvi, metadata=metadata,extra=extra) 
 print(o,c(5,3))    # print the result for the pixel at Row 5

Here is another example by first loading the individual TIFF images into an 3D array via the raster package and then applying BEAST (like the example above, the images are a tiny area for illustration purposes).

  zipfile = system.file("extdata/ndvi.zip", package="Rbeast") # the path of ndvi.zip
  tmpfld  = file.path(tempdir(), "RbeastTmpDir" )                   # get a temp folder
  dir.create(tmpfld)                                                                 # create a tmp folder
  unzip(zipfile, exdir= normalizePath(tmpfld))                       # unzip the images to the tmp folder

  filelist=list.files(path=tmpfld, pattern='*.tif', full.names=TRUE) # the tiff filenames
  ndvi3d  = stack(filelist)       # create a rasterstack object 
  inMemory(ndvi3d)             # image data not read yet
  ndvi3d  = readAll(ndvi3d)  # To use beast, make sure all the data is read into memory
                                             # 'raster' is slow with the reading,even for our small images
  ndvi3d  = brick(ndvi3d)     # Convert the stack to a RasterBrick object
  inMemory(ndvi3d)             # Now, all the image data is in memory
  dims    = dim(ndvi3d) # WARNING: the 'raster' package reports a dim of 12 x 9 x 437 
                                     #corresponding to row x col x band/time. But 'raster' seems to 
                                     # use a row-major storage mode. So internally, the RasterBrick image
                                    # data is 9 x 12 x 437 in term of R's column-major storage mode:
                                    #  beast also use the col-major mode.
  Y           = values(ndvi3d)
  dim(Y)   = c(9, 12, 437)       # Assign the correct column-major dim
  datestr  = names(ndvi3d)   # the time strings

  # Now the image data is read into Y (a 9x12x437 array); then set up some BEAST parameters

  metadata                  = list()        
  metadata$isRegularOrdered = FALSE    # IRREGULAR input
  metadata$whichDimIsTime   = 3            # 437 is the ts length, so set it to '3' here.
  metadata$time$datestr     = datestr       # date info is contained in the file names
  metadata$time$strfmt      = 'LT05_018032_20080311.yyyy-mm-dd'
  metadata$deltaTime        = 1/12            # Aggregate Y into a monthly ts: 1/12 year
  metadata$period           = 1.0                # The period is 1 year: deltaTime*freq=1/12*12=1.0

  extra = list()
  extra$numThreadsPerCPU    = 3   # Each cpu core will spawn 3 threads
  extra$numParThreads       = 30     # A total of 30 threads used, each processing individual
                                                        #  time series. For a computer with 100 cores, only the 
                                                        # first 30/3=10 cores will be used; if the computer has 5 
                                                        # cores only, then each core will spawn 30/5=6 threads 
                                                          # instead of numThreadsPerCPU=3. 

  o=beast123(Y, metadata=metadata,extra=extra) # Default values used for missing parameters 

  #WARNING: bcz the input ts is from a RasterBrick object that has a row-major storage mode, 
  #the beast output will be of size 9 x 12 for each band/time. That is, the row and col are
  #reversed, compared to the raster package. This is a glitch only with the raster package.
  #The beast output dim is consistent with the dim of the input Y.

  image(o$trend$ncp)                       # number of changepoints over space
  grid::grid.raster( o$trend$ncpPr[,,1:3]) # an pseduo-RGB image composed from the prob of 
                                           # having 0, 1, and 2 trend changepoints.
  plot(o,c(4,5))
  unlink(tmpfld, recursive = TRUE)         # Delete the tmp folder and extracted tif files 
zhaokg commented 2 years ago

Thanks for the comments on the doc. Yes, the doc for the package sucks--something I haven't put too many efforts into. As to marg_lik and sig2, marg_lik is the estimated marginal model likelihood (or model evidence). The larger it is, the better the fitted model. The use of marg_lik can allow anweriqng questions like which model hypothesis is bttter (e.g., one model with a maximum of 4 changepioints vs another model with a maximum of 2 changepoints). In this post, I gave an example to illustrate its usage to compare alternative models: https://stackoverflow.com/questions/51483673/model-comparison-for-breakpoint-time-series-model-in-r-strucchange.

For sig2, it is the error variance, just as in typical linear regression. Here is an example:

Y=rnorm(50)*3;
# no changepoint at all; the min and max number of changepoints are set to both zero
o=beast(Y, season='none', tcp.minmax = c(0,0)) 

o$sig2                       
# gives  8.992739; the true value is 3*3=9; your value may differ due to the use of a different seed to simulate Y

Y=rnorm(50)*10;
o=beast(Y, season='none', tcp.minmax = c(0,0))
o$sig2 #gives  112.4101 ; the true value is 10*10=100