Here is a description of techniques researchers might consider for parsing documents in bulk. This has been a mainstay of my own research and I used the first method in my earlier works, but in my later works, I have been using method 3 given time constraints. Any of these options should be effective. 1) Sweat equity method: I used this method on all of my earlier works. Basically I recommend a framing algorithm. Suppose you have 100,000 documents. You should open the first one and manually identify where the section begins and ends by identifying very specific text that if you see it again, can only mean that it is the beginning and end of the section you want. Complete step one by “coding” the begin/end phrases as the first “frame”. In your programming language, you would code your algorithm such that if it ever sees those labels again, it will auto-parse. Step 2: Now design your algorithm to “search the corpus” until it finds a document that does not match a frame you already coded. In this case, you will have to open that document by hand and code another frame into the system. The third step is to simply keep doing this. What you will find is that as you code up more frames, the process will get faster and faster. After coding 100, for example, the algorithm will start auto-coding about 33%. After you code 200, it will auto-code 50%, etc etc. This process of frame coding could take anywhere from a couple weeks to a month depending on how complex your corpus is. But the good news is that there are economic reasons why the frames will start to auto code quicker over time. For example, many firms might use the same auditors and have the same templates, etc. 2) Cash method 1: Hire people to do the above for you. 3) Cash method 2: Hire a company like metaHeuristica to auto-parse. I use them in many of my more recent papers, as honestly I don’t like to spend the time doing the sweat equity method any more. MetaHeuristica is absolutely stellar in auto-parsing. The contact there is Christopher Ball (Christopher.Ball: christopher.ball@metaheuristica.com). When he is not busy, he might charge lower rates (2-5 thousand perhaps). I think he is busier than average these days though. In any case, parsing is a technique for which there are more and more commercial options that should become available. 4) Academic method: convince a comp sci student to be a coauthor or pay them to do above. I hope this is helpful!