Core dump when running train_val_seg.py

tiger-bug commented 4 years ago

Good afternoon,

First of all, thank you for your code. This is very good work.

I am trying to implement this in some of my own data. However, when I am running it there seems to be an issue when I run into the "net" line. I started implementing it line by line to see where the issue was because nothing was being produced in my log file.

Note I'm in troubleshooting mode, so the code is incomplete

Code 1 (without net line)


"""Training and Validation On Segmentation Task."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import sys
import math
import random
import shutil
import argparse
import importlib
sys.path.append('/opt/pointcnn/util_files/')
import data_utils
import numpy as np
import pointfly as pf
import tensorflow as tf
from datetime import datetime
from pointcnn_seg import Net
# from pointcnn import PointCNN
#
# class Net(PointCNN):
#     def __init__(self, points, features, is_training, setting):
#         PointCNN.__init__(self, points, features, is_training, setting)
#         self.logits = pf.dense(self.fc_layers[-1], setting.num_class, 'logits',
#                                is_training, with_bn=False, activation=None)
def main():
    # os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
    parser = argparse.ArgumentParser()
    parser.add_argument('--filelist', '-t', help='Path to training set ground truth (.txt)', required=True)
    parser.add_argument('--filelist_val', '-v', help='Path to validation set ground truth (.txt)', required=True)
    parser.add_argument('--load_ckpt', '-l', help='Path to a check point file for load')
    parser.add_argument('--save_folder', '-s', help='Path to folder for saving check points and summary', required=True)
    parser.add_argument('--model', '-m', help='Model to use', required=True)
    parser.add_argument('--setting', '-x', help='Setting to use', required=True)
    parser.add_argument('--epochs', help='Number of training epochs (default defined in setting)', type=int)
    parser.add_argument('--batch_size', help='Batch size (default defined in setting)', type=int)
    parser.add_argument('--log', help='Log to FILE in save folder; use - for stdout (default is log.txt)', metavar='FILE', default='log.txt')
    parser.add_argument('--no_timestamp_folder', help='Dont save to timestamp folder', action='store_true')
    parser.add_argument('--no_code_backup', help='Dont backup code', action='store_true')
    args = parser.parse_args()
    # print(args)
    if not args.no_timestamp_folder:
        time_string = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
        root_folder = os.path.join(args.save_folder, '%s_%s_%s_%d' % (args.model, args.setting, time_string, os.getpid()))
    else:
        root_folder = args.save_folder
    if not os.path.exists(root_folder):
        os.makedirs(root_folder)

    if args.log != '-':
        sys.stdout = open(os.path.join(root_folder, args.log), 'w')

    print('PID:', os.getpid())

    print(args)

    model = importlib.import_module(args.model)
    setting_path = os.path.join(os.path.dirname(__file__), args.model)
    sys.path.append(setting_path)
    setting = importlib.import_module(args.setting)

    num_epochs = args.epochs or setting.num_epochs
    batch_size = args.batch_size or setting.batch_size
    sample_num = setting.sample_num
    step_val = setting.step_val
    label_weights_list = setting.label_weights
    rotation_range = setting.rotation_range
    rotation_range_val = setting.rotation_range_val
    scaling_range = setting.scaling_range
    scaling_range_val = setting.scaling_range_val
    jitter = setting.jitter
    jitter_val = setting.jitter_val

    # Prepare inputs
    print('{}-Preparing datasets...'.format(datetime.now()))
    is_list_of_h5_list = not data_utils.is_h5_list(args.filelist)
    if is_list_of_h5_list:
        seg_list = data_utils.load_seg_list(args.filelist)
        seg_list_idx = 0
        filelist_train = seg_list[seg_list_idx]
        seg_list_idx = seg_list_idx + 1
    else:
        filelist_train = args.filelist
    data_train, _, data_num_train, label_train, _ = data_utils.load_seg(filelist_train)
    data_val, _, data_num_val, label_val, _ = data_utils.load_seg(args.filelist_val)

    # shuffle
    data_train, data_num_train, label_train = \
        data_utils.grouped_shuffle([data_train, data_num_train, label_train])

    num_train = data_train.shape[0]
    point_num = data_train.shape[1]
    num_val = data_val.shape[0]
    print('{}-{:d}/{:d} training/validation samples.'.format(datetime.now(), num_train, num_val))
    batch_num = (num_train * num_epochs + batch_size - 1) // batch_size
    print('{}-{:d} training batches.'.format(datetime.now(), batch_num))
    batch_num_val = math.ceil(num_val / batch_size)
    print('{}-{:d} testing batches per test.'.format(datetime.now(), batch_num_val))

    print(data_train.shape)

    #
    ######################################################################
    # Placeholders
    indices = tf.placeholder(tf.int32, shape=(None, None, 2), name="indices")
    xforms = tf.placeholder(tf.float32, shape=(None, 3, 3), name="xforms")
    rotations = tf.placeholder(tf.float32, shape=(None, 3, 3), name="rotations")
    jitter_range = tf.placeholder(tf.float32, shape=(1), name="jitter_range")
    global_step = tf.Variable(0, trainable=False, name='global_step')
    is_training = tf.placeholder(tf.bool, name='is_training')

    pts_fts = tf.placeholder(tf.float32, shape=(None, point_num, setting.data_dim), name='pts_fts')
    labels_seg = tf.placeholder(tf.int64, shape=(None, point_num), name='labels_seg')
    labels_weights = tf.placeholder(tf.float32, shape=(None, point_num), name='labels_weights')

    ######################################################################
    pts_fts_sampled = tf.gather_nd(pts_fts, indices=indices, name='pts_fts_sampled')
    features_augmented = None
    print(setting.data_dim)
    #####################################################################
    if setting.data_dim > 3:
        points_sampled, features_sampled = tf.split(pts_fts_sampled,
                                                        [3, setting.data_dim - 3],
                                                        axis=-1,
                                                        name='split_points_features')
        # print(points_sampled,'\n',features_sampled)

    if setting.use_extra_features:
        features_augmented = features_sampled

    # print(xforms,'\n',jitter_range)
    points_augmented = pf.augment(points_sampled, xforms, jitter_range)
    labels_sampled = tf.gather_nd(labels_seg, indices=indices, name='labels_sampled')
    labels_weights_sampled = tf.gather_nd(labels_weights, indices=indices, name='labels_weight_sampled')
    print(labels_sampled)
    print(labels_weights_sampled)
   # net = model.Net(points_augmented, features_augmented, is_training, setting)

Output in logfile

PID: 460
Namespace(batch_size=None, epochs=None, filelist='/pc/DublinCityAll/all/pointcnn/train_data_files.txt', filelist_val='/pc/DublinCityAll/all/pointcnn/val_data_files.txt', load_ckpt=None, log='log.txt', model='util_files', no_code_backup=False, no_timestamp_folder=False, save_folder='/opt/pointcnn/models/seg/', setting='semantic_x4_2048_fps')
2020-01-07 20:07:46.483095-Preparing datasets...
2020-01-07 20:07:47.273828-400/2104 training/validation samples.
2020-01-07 20:07:47.273855-1024 training batches.
2020-01-07 20:07:47.273868-22 testing batches per test.
(400, 8192, 5)
5
Tensor("labels_sampled:0", shape=(?, ?), dtype=int64)
Tensor("labels_weight_sampled:0", shape=(?, ?), dtype=float32)

Which is as expected.

Now, when I uncomment the last line net = model.Net(points_augmented, features_augmented, is_training, setting) the log file is blank. I get a long tensorflow warning message in the setting.txt file about TF2, which may be causing the issue? I have no idea. There are no error messages, just warnings. I realize this is slightly different and I apologize for the long message. Has anyone come across this? Thank you!

Note: I am using Tensorflow 1.14 inside a Docker container.

tiger-bug commented 4 years ago

I went back and tested it on the original point cloud. It still looks like the net line blanks out the previously printed statements, I'm not sure why.

The issue now appears to be a core dump when I run the code. Would changing the batch size or point number help? I'll play around with it and see. Just wanted to give an update.

tiger-bug commented 4 years ago

Log file prints (I was just being impatient. User error on my part) so I have edited the title of the issue.

I believe I solved the issue in my previous comment. I don't think my computer had enough memory to handle the size of the data. When I tried it on a better machine, it worked. I will note that I had to change from TF 1.6 (used in the original documentation) to TF 1.7. I believe this is due to the system I was working on.

For anyone who comes across this issue I am going to double check my own data on my laptop to double check that it works but for now I will close the issue.

Also I apologize for editing the title so much. I thought we had to manually strike out the original (I've never edited the title of an issue before)

yangyanli / PointCNN

Core dump when running train_val_seg.py #204