Michael W. Lucas’s “Networking for System Administrators” is a great resource: https://mwl.io/nonfiction/networking#n4sa
pointless
Michael W. Lucas’s “Networking for System Administrators” is a great resource: https://mwl.io/nonfiction/networking#n4sa
PyMuPDF is excellent for extracting ‘structured’ text from a pdf page — though I believe ‘pulling out relevant information’ will still be a manual task, UNLESS the text you’re working with allows parsing into meaningful units.
That’s because ‘textual’ content in a pdf is nothing other than a bunch of instructions to draw glyphs inside a rect that represents a page; utilities that come with mupdf or poppler arrange those glyphs (not always perfectly) into ‘blocks’, ‘lines’, and ‘words’ based solely on whitespace separation; the programmer who uses those utilities in an end-user facing application then has to figure out how to create the illusion (so to speak) that the user is selecting/copying/searching for paragraphs, sentences, and so on, in proper reading order.
PyMuPDF comes with a rich collection of convenience functions to make all that less painful; like dehyphenation, eliminating superfluous whitespace, etc. but still, need some further processing to pick out humanly relevant info.
Built-in regex capabilities of Python can suffice for that parsing; but if not, you might want to look into NLTK tools, which apply sophisticated methods to tokenize words & sentences.
EDIT: I really should’ve mentioned some proper full text search tools. Once you have a good plaintext representation of a pdf page, you might want to feed that representation into tools like the following to index them properly for relevant info:
https://lunr.readthedocs.io/en/latest/ – this is easy to use, & set up, esp. in a python project.
… it’s based on principles that are put to use in this full-scale, ‘industrial strength’ full text search engine: https://solr.apache.org/ – it’s a bit of a pain to set up; but python can interface with it through any http client. Once you set up some kind of mapping between search tokens/keywords/tags, the plaintext page, & the actual pdf, you can get from a phrase search, for example, to a bunch of vector graphics (i.e. the pdf) relatively painlessly.
Another vote for Tesseract – just to clarify the terminology, though: PDF is a fragile format best used read-only; so you really don’t want to edit a pdf, but make a new one using the same (or cleaned-up) bitmaps and a new ocr text layer.
Now, tesseract is excellent at recognizing glyphs; but especially if the scanned image is a little fuzzy, the layout detection falters; and when it falters, you get redundant line breaks, & chunks of text in the wrong order – all of which gets incredibly annoying for searching & copying purposes. So if you can spare the time, and the text requires it, you may need to mark regions (paragraphs & titles mainly) on the bitmap image manually. There exist a few frontends to Tesseract that help with a task like that; check out, e.g., https://github.com/manisandro/gImageReader - inside single paragraph blocks of text, Tesseract doesn’t get as easily confused; and the text output is in the correct reading order, & w/o redundant breaks.
Recently I became aware of ‘StarLite’ tablets – the prices are pretty steep, but the specs look really good, esp. wrt the screen.
Wouldn’t enabling the
--system-site-packages
flag during venv creation do exactly what the OP wants, provided that gunicorn is installed as a system package (e.g. with the distro’s package manager)? https://docs.python.org/3/library/venv.htmlSharing packages between venvs would be a dirty trick indeed; though sharing with
system-site-packages
should be fine, AFAIK.