yoheikikuta / US-patent-analysis

This is a repository of the analysis of US patent.
8 stars 8 forks source link

Splitting Files with AWK #5

Open Pawlovicky opened 5 years ago

Pawlovicky commented 5 years ago

Challenges when splitting the concated xml file from https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2017

Pawlovicky commented 5 years ago

first attempt splitting the file with awk at the us-patent-application closing tag awk -v x="F0.xml" "/<\/us-patent-application>/ "'{ print "" > x; x="F"++i".xml";next}{ print > x; }' $1

-> resulted in issue that the file still contained junks of other XML files before the opening tag. (see e.g. https://mlstudygroup-phys.slack.com/files/U75940LTZ/FCM53B4F2/strange_xml.txt)

According to susumu these XML files come from https://www.uspto.gov/patent/initiatives/complex-work-unit-pilot-program

decision made to skip these for the time being

Pawlovicky commented 5 years ago

AWK code was revised as below. close(x) added to prevent 'too many files open error'.

awk -v x="F0.xml" "/<us-patent-application*/, /<\/us-patent-application>/"'{ print > x; } /<\/us-patent-application>/ {close(x); x="F"++i".xml"}' $1

Pawlovicky commented 5 years ago

files from <= 2004 do not have the tag <us-patent-application but a tag