xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

Any experience with a gfx902 APU -> Ryzen 5850U #19

Closed delijati closed 2 years ago

delijati commented 2 years ago

Hi,

i have a gfx902 APU -> Ryzen 5850U. I'm just reaching into the wild if someone has any success getting this card to run with rocm.

Thanks

❯ rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 PRO 5850U with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 PRO 5850U with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1900                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    28567620(0x1b3e844) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    28567620(0x1b3e844) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    28567620(0x1b3e844) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx902                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Cezanne                            
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 5688(0x1638)                       
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   1792                               
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            4                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx902:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             
xuhuisheng commented 2 years ago

ROCm said they cannot support APU now. Somebody said there are issues when using APU and GPU at the same time. You can have a try.

delijati commented 2 years ago

I tried i even build tensorflow-upstream but is still get:

❯ ../../env/bin/python 02-Clustering.py
GENERATING EMBEDDING FOR: ATL_X
/home/foo/.cache/yay/hip-rocclr/src/HIP-rocm-4.3.1/rocclr/hip_code_object.cpp:486: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[1]    24231 abort (core dumped)  ../../env/bin/python 02-Clustering.py
../../env/bin/python 02-Clustering.py  3,06s user 4,08s system 141% cpu 5,056 total
xuhuisheng commented 2 years ago

we can use AMD_LOG_LEVEL=6 to print out more logs.

delijati commented 2 years ago
$ AMD_LOG_LEVEL=6 ../../env/bin/python 02-Clustering.py
GENERATING EMBEDDING FOR: ATL_X
:3:rocdevice.cpp            :430 : 1885913346 us: Initializing HSA stack.
:3:comgrctx.cpp             :33  : 1885933593 us: Loading COMGR library.
:3:rocdevice.cpp            :196 : 1885936584 us: Numa selects cpu agent[0]=0x5568b74df830(fine=0x5568bb072be0,coarse=0x5568bad2bcf0, kern_arg=0x5568bb6f3f90) for gpu agent=0x7fa4db72ab34
:3:rocdevice.cpp            :1562: 1885937163 us: HMM support: 0, xnack: 0

:4:rocdevice.cpp            :1858: 1885937272 us: Allocate hsa host memory 0x7fa4e0002000, size 0x28
:4:rocdevice.cpp            :1858: 1885937696 us: Allocate hsa host memory 0x7fa460600000, size 0x101000
:4:rocdevice.cpp            :1858: 1885937997 us: Allocate hsa host memory 0x7fa460400000, size 0x101000
:4:runtime.cpp              :82  : 1885938102 us: init
:1:hip_code_object.cpp      :456 : 1885938529 us: hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp      :458 : 1885938540 us:   Devices:
:1:hip_code_object.cpp      :460 : 1885938542 us:     amdgcn-amd-amdhsa--gfx902:xnack- - [Not Found]
:1:hip_code_object.cpp      :465 : 1885938543 us:   Bundled Code Objects:
:1:hip_code_object.cpp      :482 : 1885938544 us:     host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp      :479 : 1885938546 us:     hipv4-amdgcn-amd-amdhsa--gfx1030 - [code object v4 is amdgcn-amd-amdhsa--gfx1030]
:1:hip_code_object.cpp      :479 : 1885938547 us:     hipv4-amdgcn-amd-amdhsa--gfx803 - [code object v4 is amdgcn-amd-amdhsa--gfx803]
:1:hip_code_object.cpp      :479 : 1885938549 us:     hipv4-amdgcn-amd-amdhsa--gfx900:xnack- - [code object v4 is amdgcn-amd-amdhsa--gfx900:xnack-]
:1:hip_code_object.cpp      :479 : 1885938550 us:     hipv4-amdgcn-amd-amdhsa--gfx906:xnack- - [code object v4 is amdgcn-amd-amdhsa--gfx906:xnack-]
:1:hip_code_object.cpp      :479 : 1885938552 us:     hipv4-amdgcn-amd-amdhsa--gfx908:xnack- - [code object v4 is amdgcn-amd-amdhsa--gfx908:xnack-]
:1:hip_code_object.cpp      :479 : 1885938553 us:     hipv4-amdgcn-amd-amdhsa--gfx90a:xnack+ - [code object v4 is amdgcn-amd-amdhsa--gfx90a:xnack+]
:1:hip_code_object.cpp      :479 : 1885938555 us:     hipv4-amdgcn-amd-amdhsa--gfx90a:xnack- - [code object v4 is amdgcn-amd-amdhsa--gfx90a:xnack-]
/home/foo/.cache/yay/hip-rocclr/src/HIP-rocm-4.3.1/rocclr/hip_code_object.cpp:486: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[1]    17615 abort (core dumped)  AMD_LOG_LEVEL=6 ../../env/bin/python 02-Clustering.py
AMD_LOG_LEVEL=6 ../../env/bin/python 02-Clustering.py  2,52s user 3,90s system 141% cpu 4,544 total
delijati commented 2 years ago

At least i got HIP running:

❯ AMD_LOG_LEVEL=6 ./square.out
:3:rocdevice.cpp            :430 : 10625438911 us: Initializing HSA stack.
:3:comgrctx.cpp             :33  : 10625460831 us: Loading COMGR library.
:3:rocdevice.cpp            :196 : 10625465529 us: Numa selects cpu agent[0]=0x205e2a0(fine=0x20f1f80,coarse=0x20f7560, kern_arg=0x210f8d0) for gpu agent=0x7fcbce7c3b34
:3:rocdevice.cpp            :1562: 10625466470 us: HMM support: 0, xnack: 0

:4:rocdevice.cpp            :1858: 10625466635 us: Allocate hsa host memory 0x7fcbcea34000, size 0x28
:4:rocdevice.cpp            :1858: 10625467138 us: Allocate hsa host memory 0x7fcbcd400000, size 0x101000
:4:rocdevice.cpp            :1858: 10625467587 us: Allocate hsa host memory 0x7fcbcd200000, size 0x101000
:4:runtime.cpp              :82  : 10625467659 us: init
:3:hip_device.cpp           :239 : 10625467704 us: 30526: [7fcbcddfb540] hipGetDeviceProperties: Returned hipSuccess : 
info: running on device Cezanne
info: allocate host mem (  7.63 MB)
info: allocate device mem (  7.63 MB)
:3:hip_memory.cpp           :384 : 10625470790 us: 30526: [7fcbcddfb540] hipMalloc ( 0x7ffca2147320, 4000000 )
:4:rocdevice.cpp            :1993: 10625470946 us: Allocate hsa device memory 0x7fcbcc400000, size 0x3d0900
:3:rocdevice.cpp            :2032: 10625470952 us: device=0x211d4b0, freeMem_ = 0xffc2f700
:3:hip_memory.cpp           :386 : 10625470960 us: 30526: [7fcbcddfb540] hipMalloc: Returned hipSuccess : 0x7fcbcc400000: duration: 170 us
:3:hip_memory.cpp           :384 : 10625470964 us: 30526: [7fcbcddfb540] hipMalloc ( 0x7ffca2147318, 4000000 )
:4:rocdevice.cpp            :1993: 10625471018 us: Allocate hsa device memory 0x7fcbc0800000, size 0x3d0900
:3:rocdevice.cpp            :2032: 10625471026 us: device=0x211d4b0, freeMem_ = 0xff85ee00
:3:hip_memory.cpp           :386 : 10625471032 us: 30526: [7fcbcddfb540] hipMalloc: Returned hipSuccess : 0x7fcbc0800000: duration: 68 us
info: copy Host2Device
:3:hip_memory.cpp           :429 : 10625471065 us: 30526: [7fcbcddfb540] hipMemcpy ( 0x7fcbcc400000, 0x7fcbcce2f010, 4000000, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :2543: 10625471616 us: number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2618: 10625478601 us: created hardware queue 0x7fcbcd5f5000 with size 1024 with priority 1, cooperative: 0
:4:rocdevice.cpp            :1858: 10625478822 us: Allocate hsa host memory 0x7fcbcc980000, size 0x80000
:3:devprogram.cpp           :2466: 10625710710 us: Using Code Object V4.
:4:command.cpp              :303 : 10625712610 us: command is enqueued: 0x214adc0
:4:command.cpp              :262 : 10625712653 us: queue marker to command queue: 0x20f1b20
:4:command.cpp              :303 : 10625712656 us: command is enqueued: 0x205e500
:4:command.cpp              :222 : 10625712657 us: waiting for event 0x214adc0 to complete, current status 3
:4:commandqueue.cpp         :176 : 10625713048 us: command (CopyHostToDevice) is submitted: 0x214adc0
:4:rocvirtual.hpp           :200 : 10625713254 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46b00
:4:rocvirtual.hpp           :228 : 10625713263 us: [7fcbcd562640]!  WaitNext completion_signal=0x7fcbcea46a80
:4:rocblit.cpp              :670 : 10625713266 us: [7fcbcd562640]!  HSA Asycn Copy wait_event=0x0, completion_signal=0x7fcbcea46b00
:4:rocvirtual.hpp           :200 : 10625713701 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46b00
:4:rocvirtual.cpp           :449 : 10625713708 us: [7fcbcd562640]!  Host wait on completion_signal=0x7fcbcea46b00
:4:commandqueue.cpp         :176 : 10625714354 us: command (InternalMarker) is submitted: 0x205e500
:4:rocvirtual.hpp           :200 : 10625714368 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46b00
:4:command.cpp              :236 : 10625714371 us: event 0x214adc0 wait completed
:4:command.cpp              :152 : 10625714372 us: Command 0x214adc0 complete
:4:command.cpp              :152 : 10625714374 us: Command 0x205e500 complete
:3:hip_memory.cpp           :432 : 10625714379 us: 30526: [7fcbcddfb540] hipMemcpy: Returned hipSuccess : : duration: 243314 us
info: launch 'vector_square' kernel
:3:hip_platform.cpp         :202 : 10625714411 us: 30526: [7fcbcddfb540] __hipPushCallConfiguration ( {512,1,1}, {256,1,1}, 0, stream:<null> )
:3:hip_platform.cpp         :206 : 10625714419 us: 30526: [7fcbcddfb540] __hipPushCallConfiguration: Returned hipSuccess : 
:3:hip_platform.cpp         :213 : 10625714430 us: 30526: [7fcbcddfb540] __hipPopCallConfiguration ( {34542240,0,34538320}, {3458397079,32715,18}, 0x7ffca2147330, 0x7ffca2147328 )
:3:hip_platform.cpp         :222 : 10625714433 us: 30526: [7fcbcddfb540] __hipPopCallConfiguration: Returned hipSuccess : 
:3:hip_module.cpp           :489 : 10625714444 us: 30526: [7fcbcddfb540] hipLaunchKernel ( 0x401c10, {512,1,1}, {256,1,1}, 0x7ffca2147370, 0, stream:<null> )
:3:devprogram.cpp           :2466: 10625714623 us: Using Code Object V4.
:3:hip_module.cpp           :358 : 10625715521 us: 30526: [7fcbcddfb540] ihipModuleLaunchKernel ( 0x0x21577a0, 131072, 1, 1, 256, 1, 1, 0, stream:<null>, 0x7ffca2147370, char array:<null>, event:0, event:0, 0, 0 )
:4:command.cpp              :303 : 10625715595 us: command is enqueued: 0x215f780
:3:hip_platform.cpp         :638 : 10625715619 us: 30526: [7fcbcddfb540] ihipLaunchKernel: Returned hipSuccess : 
:3:hip_module.cpp           :491 : 10625715635 us: 30526: [7fcbcddfb540] hipLaunchKernel: Returned hipSuccess : 
info: copy Device2Host
:3:hip_memory.cpp           :429 : 10625715651 us: 30526: [7fcbcddfb540] hipMemcpy ( 0x7fcbcca5e010, 0x7fcbc0800000, 4000000, hipMemcpyDeviceToHost )
:4:command.cpp              :303 : 10625715657 us: command is enqueued: 0x214adc0
:4:command.cpp              :262 : 10625715660 us: queue marker to command queue: 0x20f1b20
:4:command.cpp              :303 : 10625715661 us: command is enqueued: 0x2157db0
:4:command.cpp              :222 : 10625715662 us: waiting for event 0x214adc0 to complete, current status 3
:4:commandqueue.cpp         :176 : 10625715663 us: command (KernelExecution) is submitted: 0x215f780
:3:rocvirtual.cpp           :603 : 10625715679 us: !    arg0:   = ptr:0x7fcbc0800000 obj:[0x7fcbc0800000-0x7fcbc0bd0900] threadId : 7fcbcd562640
:3:rocvirtual.cpp           :603 : 10625715685 us: !    arg1:   = ptr:0x7fcbcc400000 obj:[0x7fcbcc400000-0x7fcbcc7d0900] threadId : 7fcbcd562640
:3:rocvirtual.cpp           :2560: 10625715689 us: [7fcbcd562640]!  ShaderName : _Z13vector_squareIfEvPT_S1_m
:4:rocvirtual.cpp           :753 : 10625715723 us: [7fcbcd562640] HWq=0x7fcbcd5f5000, Dispatch Header = 0x502 (type=2, barrier=1, acquire=2, release=0), setup=3, grid=[131072, 1, 1], workgroup=[256, 1, 1], private_seg_size=0, group_seg_size=0, kernel_obj=0x7fcbc0408840, kernarg_address=0x7fcbcc980000, completion_signal=0x0
:4:commandqueue.cpp         :176 : 10625715741 us: command (CopyDeviceToHost) is submitted: 0x214adc0
:4:rocvirtual.hpp           :200 : 10625717359 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46a80
:4:rocvirtual.hpp           :228 : 10625717373 us: [7fcbcd562640]!  WaitNext completion_signal=0x7fcbcea46a00
:4:rocvirtual.cpp           :871 : 10625717384 us: [7fcbcd562640] HWq=0x7fcbcd5f5000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x7fcbcea46a80
:4:rocvirtual.hpp           :200 : 10625717406 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46a00
:4:rocvirtual.hpp           :228 : 10625717414 us: [7fcbcd562640]!  WaitNext completion_signal=0x7fcbcea46980
:4:rocblit.cpp              :670 : 10625717420 us: [7fcbcd562640]!  HSA Asycn Copy wait_event=0x0, completion_signal=0x7fcbcea46a00
:4:rocvirtual.hpp           :200 : 10625718851 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46a00
:4:rocvirtual.cpp           :449 : 10625718897 us: [7fcbcd562640]!  Host wait on completion_signal=0x7fcbcea46a00
:4:commandqueue.cpp         :176 : 10625719926 us: command (InternalMarker) is submitted: 0x2157db0
:4:rocvirtual.hpp           :200 : 10625719960 us: [7fcbcd562640]!  WaitCurret completion_signal=0x7fcbcea46a00
:4:command.cpp              :152 : 10625719969 us: Command 0x215f780 complete
:4:command.cpp              :152 : 10625719978 us: Command 0x214adc0 complete
:4:command.cpp              :152 : 10625719981 us: Command 0x2157db0 complete
:4:command.cpp              :236 : 10625719984 us: event 0x214adc0 wait completed
:3:hip_memory.cpp           :432 : 10625720011 us: 30526: [7fcbcddfb540] hipMemcpy: Returned hipSuccess : : duration: 4360 us
info: check result
PASSED!
xuhuisheng commented 2 years ago

So it said one of rocm-libs isn't built for gfx902. square run properly, means you have only APU, not with GPU, I guess. Next step is test rocm-libs one by one, to find wich component need rebuild for gfx902.

xuhuisheng commented 2 years ago

2 months after last posts, I will close this issue, please reopen if there is any updates.

jf-horton commented 1 year ago

@delijati I've been trying to get HIP running on a gfx902, but have had no luck. What version of rocm did you use? Did you have to build from source?

mzimmerm commented 5 months ago

@delijati I have, in the last 2 weeks, spend significant time trying to get Pytorch on ROCm on gfx902 running. I experimented with different Linux versions but mostly Ubuntu, different ROCm / AMDGPU versions, with no luck. The few combinations I got working generally end with this error: "HIP error: shared object initialization failed. I am now declaring it a failure and impossibility, and for ML/DL testing getting a video card that does not use ROCm, as official ROCm support is about 6 cards today, no APUs.