To the BaseX Community,
I am a 68-year-old #PayItForward cancer-survivor independent #CitizenScientist doing applied research in #DigitalHumanities and #MachineLearning and I need your help, please. And please forgive me for posting such a TL;DR post.
I am not lazy nor a dilettante, I am simply under tight time pressure to get some development done on my Python-based metadata discovery and curation toolkit which my fellow cancer-surviving wife, Timlynn, and I will be showcasing via our poster presentation accepted for next month's #DATeCH2019 conference in Brussels (http://datech.digitisation.eu/).
Our poster is entitled "#MAGAZINEgts and #dhSegment: Using a Metamodel Subgraph to Generate Synthetic Data of Under-Sampled Complex Document Structures for Machine-Learning" (ResearchGate preprint: https://is.gd/factminers_datech2019_poster). #MAGAZINEgts is the XML-based ground-truth storage format Timlynn and I are developing based on an ontological "stack" of #cidocCRM/FRBRoo/PRESSoo using a metamodel subgraph design pattern. The goal of our design is to support an integrated complex document structure and content depiction model for digital collections of print magazines and newspapers. (For more, see our #DATeCH2017 poster: https://is.gd/factminers_datech2017_poster)
We are evolving a reference implementation of the #MAGAZINEgts format for the collection of Softalk magazine at the Internet Archive. The collection is here: https://archive.org/details/softalkapple?&sort=date, and the MAGAZINEgts file (~10+ MB) is linked from the About page of the collection but is provided here as a shortened link: https://is.gd/softalk_magazinegts_xml_file.
MY IMMEDIATE GOAL: Rather than keep with the awkward workflow of generating intermediary JSON metadata files and, in batches, converting to XML and copy-pasting into appropriate positions in the master publication file, I want to incorporate direct incremental updating of fine-grained #MAGAZINEgts metamodels, metadata, and their associated source-document-specific datasets via integrating BaseX into the FactMiners Toolkit (fmtk). We would _really_ like to be showing this significant enhancement of our toolkit at the DATeCH conference (8-10 May).
MY CURRENT CONTEXT: I have the latest BaseX installed, working well, and I have done as much "fast track" learning as I can to come up to "toddler" speed on BaseX and its Python-based API. I have the Python API extension installed and working within my PyCharm IDE, and I am on Windows 10.
MY CURRENT NEED: I have used the BaseX GUI to develop a sample XQuery for updating/adding a machine-learning data spec for curating the bounding boxes of advertisements on a page within a magazine. The query below is not parameterized for programmatic dynamic execution. It is simply a hard-coded test of my evolving understanding of doing BaseX interactions. So the dataset name ("all_ads"), the issue-page filename (softalk_v2n02pg002.png) and the various dimension numbers, etc. are explicit rather than variable, etc., within this sample query. When I run this query and do an export of the MAGgts master file, the update is there and looks great. Even though my knowledge and skills with BaseX are small but growing, I feel I have enough grip on things to forge ahead to at least get BaseX integrated for the #ML image training dataset feature that we will showcase at #DATeCH2019. (BTW, I stripped the #MAGgts schema's XML namespace during BaseX database creation to make things easier during learning. I expect to simply restore it in the header after exporting and before uploading a new release of Softalk's reference implementation of this ground-truth format. Either that, or I will tweak the eventual Python-implementation of the queries to include the namespace and just leave it intact when importing into the local BaseX database.)
HERE IS THE SAMPLE QUERY:
===
declare option db:WRITEBACK 'true';
declare variable $new_spec := <ML_training_img_spec file_name="softalk_v2n02pg002.png">
<ML_image_dim width="940" height="1280"/>
<ML_label_bbox label="ad" status="predicted" left="500" top="680" width="444" height="580"/>
<ML_label_bbox label="ad" status="actual" left="490" top="620" width="440" height="575"/>
</ML_training_img_spec>;
update:output("Update successful."), insert node $new_spec as last into doc("MAGgts")
//Metadata//ML_maxpixel_datasets[@max_pixels = "1000000"]
/ML_dataset[@name="all_ads"]//ML_training_img_specs
===
MY REQUEST FOR ASSISTANCE: It would be _extraordinarily_ helpful, and Timlynn and I would be most grateful, if someone within the BaseX community that has familiarity with doing Python-based BaseX integration could provide a brief implementation -- similar to the examples supplied in the Client integration samples -- that would show me how to take a BaseX GUI-developed query and convert it to a usable state in a Python program.
BY WAY OF THANK YOU: If anyone can help us short-circuit my path to integrating BaseX into our #DATeCH2019 poster presentation, we will gladly cite you and your assistance in the acknowledgements of the poster and its 2-page companion handout.
Again, I am sorry for hitting folks with such a long and detailed request for nooby assistance.
In advance, thank you for any help you Good Folks may provide. I look forward to significantly improving the functionality of the FactMiners Toolkit by incorporating BaseX into its core platform.
Happy-Healthy Vibes from Colorado USA,
-: Jim Salmons :-