Not working on RHEL7.3 ppc64le

sandipmgiri commented 7 years ago

I have installed it by command "pip install py-cpuinfo" . However, while running the bazel tests in tenosrflow I am getting the following error
Exception: py-cpuinfo currently only works on X86 and some ARM CPUs.
After that I installed it from source . And for TF tests getting some different kind of error i.e. "KeyError: 'l2_cache_size" . PFA for more details.

workhorsy commented 7 years ago

Can you give me the file system_info.txt that is generated by this script?: https://raw.githubusercontent.com/workhorsy/py-cpuinfo/master/tools/get_system_info.py

sandipmgiri commented 7 years ago

PFA. system_info.txt

Thanks!

workhorsy commented 7 years ago

Thanks. I'll see if I can fix it.

sandipmgiri commented 7 years ago

Hi @workhorsy , any update on this.

workhorsy commented 7 years ago

I've got it partially working on Debian 8 PPC64le (Everything but CPU flags). I don't have access to RHEL right now. I'll see if I can get CentOS PPC64le running in KVM.

sandipmgiri commented 7 years ago

Same issue on Ubuntu-16.04 (ppc64le) as well, so you can try on Ubuntu.

sandipmgiri commented 7 years ago

Hi @workhorsy , check this https://developer.ibm.com/linuxonpower/cloud-resources/ and see if you can get access to RHEL ppc vm.

workhorsy commented 7 years ago

Thanks @sandipmgiri , I'll look into it.

workhorsy commented 7 years ago

It should be working now.

sandipmgiri commented 7 years ago

Hi @workhorsy , now that Exception: py-cpuinfo currently only works on X86 and some ARM CPUs error is gone, but I'm getting some another error :

$ bazel test -c opt//tensorflow/tools/test:rnn_op_benchmark
-----------------------------------------------------------------------------
..
----------------------------------------------------------------------

Ran 2 tests in 0.008s

OK
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/tools/test/run_and_gather_logs.py", line 99, in <module>
    app.run()
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/tools/test/run_and_gather_logs.py", line 78, in main
    test_args)
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/tools/test/run_and_gather_logs_lib.py", line 145, in run_and_gather_logs
    log_files=log_files), mangled_test_name)
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/tools/test/run_and_gather_logs_lib.py", line 76, in process_test_logs
    system_info_lib.gather_machine_configuration())
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/tools/test/system_info_lib.py", line 44, in gather_machine_configuration
    config.cpu_info.CopyFrom(gather_cpu_info())
  File "/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/execroot/tensorflow/bazel-out/local-opt/bin/tensorflow/tools/test/rnn_op_benchmark.runfiles/org_tensorflow/tensorflow/tools/test/system_info_lib.py", line 98, in gather_cpu_info
    l2_cache_size = re.match(r'(\d+)', str(info['l2_cache_size']))
KeyError: 'l2_cache_size'

I would like to know your comments/suggestions ?

workhorsy commented 7 years ago

The returned info dict will only contain keys for data that was found. So it looks like there was no 'l2_cache_size' data found. It looks like this is a bug in the tensorflow script.

https://github.com/tensorflow/tensorflow/blob/34df7ee6f5a8f931d2433dc7e6e739bc880a35ea/tensorflow/tools/test/system_info_lib.py#L94

It should not just assume that the keys will be there:

  # Gather the rest
  info = cpuinfo.get_cpu_info()
  cpu_info.cpu_info = info['brand']
  cpu_info.num_cores = info['count']
  cpu_info.mhz_per_cpu = info['hz_advertised_raw'][0] / 1.0e6
  l2_cache_size = re.match(r'(\d+)', str(info['l2_cache_size']))

It could be changed to something like:

  # Gather the rest
  info = cpuinfo.get_cpu_info()
  cpu_info.cpu_info = info.get('brand', None)
  cpu_info.num_cores = info.get('count', None)
  cpu_info.mhz_per_cpu = info.get('hz_advertised_raw', (0, 0))[0] / 1.0e6
  if 'l2_cache_size' in info:
      l2_cache_size = re.match(r'(\d+)', str(info['l2_cache_size']))

workhorsy commented 7 years ago

Also note @sandipmgiri that it may be a bug in py-cpuinfo that it may be failing to gather the correct cache size.

sandipmgiri commented 7 years ago

Hi @workhorsy,

import cpuinfo
print(cpuinfo.get_cpu_info())

I ran this sample code on ppc64le and x86 to check get_cpu_info() output;

ppc64le o/p :

{'count': 160, 'hz_advertised': '4.0230 GHz', 'bits': 64, 'brand': 'POWER8NVL (raw), altivec supported', 'cpuinfo_version': (3, 2, 0), 'flags': ['dabr', 'dabrx', 'dsisr', 'fpu', 'lp', 'mmu', 'pp', 'rislb', 'run', 'slb', 'sprg3', 'ugr_in_dscr'], 'raw_arch_string': 'ppc64le', 'hz_actual_raw': (4023000000, 0), 'hz_actual': '4.0230 GHz', 'arch': 'PPC_64', 'hz_advertised_raw': (4023000000, 0)}

x86 o/p:

{'count': 1, 'model': 42, 'hz_advertised': '2.2947 GHz', 'family': 6, 'bits': 64, 'brand': 'Intel Xeon E312xx (Sandy Bridge)', 'vendor_id': 'GenuineIntel', 'cpuinfo_version': (3, 2, 0), 'flags': ['abm', 'aes', 'apic', 'avx', 'avx2', 'bmi1', 'bmi2', 'clflush', 'cmov', 'constant_tsc', 'cx16', 'cx8', 'de', 'eagerfpu', 'ept', 'erms', 'f16c', 'fma', 'fpu', 'fsgsbase', 'fxsr', 'hypervisor', 'invpcid', 'lahf_lm', 'lm', 'mca', 'mce', 'mmx', 'movbe', 'msr', 'mtrr', 'nopl', 'nx', 'pae', 'pat', 'pcid', 'pclmulqdq', 'pdpe1gb', 'pge', 'pni', 'popcnt', 'pse', 'pse36', 'rdrand', 'rdtscp', 'rep_good', 'sep', 'smep', 'ss', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'syscall', 'tsc', 'tsc_deadline_timer', 'vme', 'vmx', 'vnmi', 'x2apic', 'xsave', 'xsaveopt'], 'raw_arch_string': 'x86_64', 'l2_cache_size': '4096 KB', 'stepping': 1, 'hz_actual_raw': (2294686000, 0), 'hz_actual': '2.2947 GHz', 'arch': 'X86_64', 'hz_advertised_raw': (2294686000, 0)}

It seems that l2_cache_size is not something that is returned by get_cpu_info() on ppc64le. Not sure, but it looks like this is a bug in py-cpuinfo module on ppc64le (failed to gather l2_cache_size).

sandipmgiri commented 7 years ago

Hi @workhorsy ,

I have discussed about this issue on tensorflow community :

Like you said , we also got to know that l2_cache_size is not something that is returned by get_cpu_info() on ppc64le platform
So to resolve this issue , we made changes (similar to yours) in the tensorflow script. https://github.com/tensorflow/tensorflow/blob/v1.0.1/tensorflow/tools/test/system_info_lib.py#L93

# Gather the rest
  info = cpuinfo.get_cpu_info()
  cpu_info.cpu_info = info['brand']
  cpu_info.num_cores = info['count']
  cpu_info.mhz_per_cpu = info['hz_advertised_raw'][0] / 1.0e6
  l2_cache_size = re.match(r'(\d+)', str(info['l2_cache_size']))

Changed to :

# Gather the rest
  info = cpuinfo.get_cpu_info()
  cpu_info.cpu_info = info['brand']
  cpu_info.num_cores = info['count']
  cpu_info.mhz_per_cpu = info['hz_advertised_raw'][0] / 1.0e6
  l2_cache_size = re.match(r'(\d+)', str(info.get('l2_cache_size', '')))

Relevant tensorflow discussion link - https://github.com/tensorflow/tensorflow/issues/10371 Relevant code change link - https://github.com/tensorflow/tensorflow/commit/69ba4d3d49bd5775131ae7f00830a41f478dbbf5

Now this issue is resolved. Thanks a lot for your suggestions and pointers !

workhorsy commented 7 years ago

Cool @sandipmgiri . Thanks for the info. Hopefully ppc support in py-cpuinfo can improve to have better cache info.

workhorsy / py-cpuinfo

Not working on RHEL7.3 ppc64le #61