xbrlus / wip

Surety Work in Process Taxonomy
https://xbrl.us/2021-surety-wip/
2 stars 2 forks source link

Performance with large XML files #5

Open keyurpd opened 4 years ago

keyurpd commented 4 years ago

I was trying parsing large xml files with thousands of contracts. And it's very slow obviously.

I have a feeling the structure/xsd allows a single contract details to be present theoretically anywhere within the file. Then there are two context strings for a single contract/axis. So you have to do a double string comparison for attributes within the entire file.

1000 contracts = 1000 axis = 2000 contexts = 15000 elements (Assuming 15 columns/contract) = (2000 x 15000) string comparisons!

Is there a way to speed up the processing?

keyurpd commented 4 years ago

Would appreciate some response. As I believe, the XBRL defined structure is playing a big part in compounding the problem.

Stay safe!

davidtauriello commented 4 years ago

Hi @keyurpd - apologies for the delay in responding. It seems like you can reduce the time spent processing by first indexing the contexts in the filing. By leveraging this attribute, you'll reduce the number of string comparisons significantly.

It would be great to learn more about how you're using the WIP taxonomy - email david.tauriello @ xbrl.us with a few dates and times when you have 30 min to help my team understand the work you're doing and its goals.

keyurpd commented 4 years ago

Thanks @davidtauriello

This is how we are doing it currently. Only the item in italic is the biggest bottleneck though.

  1. Technology: .NET Core, C#
  2. Basic design: Read file in memory using XElement. Use LINQ to create POCO objects, array of Contracts, steps below.
  3. Load file into an XElement - everything now is in-memory.
  4. Extract document metadata; date, year, entity name.
  5. Extract all the contexts, grouped by axis.
  6. Iterate through all the axes (rows); get both the contexts for each row.
  7. find all financial elements (wip/gaap) for a given pair of contexts. Repeat for all pairs. <-- This is the slowest part of the system. Way too slower than any of the above.

Overall the above approach is acceptable when the Contracts are in few hundreds. But when it reaches in thousands, it does sometimes, it becomes painfully slow. I believe, the free-style structure is playing a very big part in contributing to the slowness.

keyurpd commented 4 years ago

One alternative design could be hierarchical groups...

<row>
  <column>
  <column>
</row>
<row>
  <column>
  <column>
</row>