Open Pawlovicky opened 5 years ago
first attempt splitting the file with awk at the us-patent-application closing tag awk -v x="F0.xml" "/<\/us-patent-application>/ "'{ print "" > x; x="F"++i".xml";next}{ print > x; }' $1
-> resulted in issue that the file still contained junks of other XML files before the
According to susumu these XML files come from https://www.uspto.gov/patent/initiatives/complex-work-unit-pilot-program
decision made to skip these for the time being
AWK code was revised as below. close(x) added to prevent 'too many files open error'.
awk -v x="F0.xml" "/<us-patent-application*/, /<\/us-patent-application>/"'{ print > x; } /<\/us-patent-application>/ {close(x); x="F"++i".xml"}' $1
files from <= 2004 do not have the tag <us-patent-application
but a tag
Challenges when splitting the concated xml file from https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2017