-
Notifications
You must be signed in to change notification settings - Fork 5
ami summary
Aggregates files in the CProject hypertree into a toplevel tree. Useful for collecting all sections of a given type in one place.
ami summary --help
Usage: ami summary [OPTIONS]
Description
===========
Summarizes the CTree files into a single toplevel CProject directory tree.Used to be hardcoded , but now can be
controlled by glob
Options
=======
--dictionary=<dictionaryList>...
dictionaries to summarize. Probably OBSOLETE
Default: []
--gene[=<geneList>...]
genes to summarize. OBSOLETE
Default: human
--glob=<globList>[,<globList>...]
files to summarize (as glob)
Default: []
-h, --help Show this help message and exit.
--output-types=<outputTypes>...
output type/s. Not sure how useful this is. `table` creates a CSV table
Default: []
--species[=<speciesList>...]
species to summarize. OBSOLETE
Default: binomial
-V, --version Print version information and exit.
--word analyze word frequencies. Probably OBSOLETE.
Typical ami CTree below (in src/test/resources/ami/battery10
) which has many subtrees created by ami
commands. It is "snipped" ===
into separate chunks, all of which could be summary
-ized. ami
iterates over many CTree
s , each of which is different in details of hierarchy, numbering, content. Note that the leading digits (1_body
are to keep the order of the sections as otherwise they may be scrambled by directory listing.
PMC3893646/
├── eupmc_result.json
===
├── fulltext.pdf
===
├── fulltext.xml
===
├── pdfimages
│ ├── image.2.2.304_553.69_273
│ │ ├── image.2.2.304_553
│ │ │ ├── image.2.2.304_553
│ │ │ │ └── hocr
===................................. OCR from image
│ │ │ │ ├── hocr.html
===................................. OCR from image (probably best)
│ │ │ │ └── hocr.svg
===................................. raw image
│ │ │ └── image.2.2.304_553.png
│ │ ├── images.html
│ │ ├── octree
│ │ │ ├── binary.png
===................................. colour channels for image
│ │ │ ├── channel.07a957.png
...
│ │ │ ├── channel.fcfcfc.png
│ │ │ ├── channels.html. ... browse
│ │ │ ├── histogram.svg
│ │ │ └── octree.png
│ │ ├── raw
│ │ │ └── hocr
│ │ ├── raw.annot.html
│ │ ├── raw.png
│ │ └── raw_o8.png
...
│ └── images.html
├── results
│ ├── search
│ │ ├── country
===................................. ami search results (entity in context)
│ │ │ └── results.xml
│ │ ├── elements
│ │ │ └── results.xml
│ │ └── funders
│ │ └── results.xml
===................................ word frequencies
│ └── word
│ └── frequencies
│ ├── results.html
│ └── results.xml
===................................ fulltext in HTML
├── scholarly.html
├── search.country.count.xml .... probably obsolete
├── search.country.snippets.xml
├── search.elements.count.xml
├── search.elements.snippets.xml
├── search.funders.count.xml
├── search.funders.snippets.xml
├── sections
===................................ bibliography (the directory names are controlled)
│ ├── 0_front
│ │ ├── 0_journal-meta
│ │ │ ├── 0_journal-id.xml
│ │ │ ├── 1_journal-id.xml
│ │ │ ├── 2_journal-title-group.xml
│ │ │ ├── 3_issn.xml
│ │ │ └── 4_publisher.xml
│ │ └── 1_article-meta
│ │ ├── 0_article-id.xml
│ │ ├── 10_elocation-id.xml
│ │ ├── 11_history.xml
│ │ ├── 12_permissions.xml
│ │ ├── 13_abstract.xml
│ │ ├── 1_article-id.xml
│ │ ├── 2_article-id.xml
│ │ ├── 3_article-categories.xml
│ │ ├── 4_title-group.xml
│ │ ├── 5_contrib-group.xml
│ │ ├── 6_author-notes.xml
│ │ ├── 7_pub-date.xml
│ │ ├── 8_pub-date.xml
│ │ └── 9_volume.xml
===................................ body (sections mainly as HTML) can be snipped anywhere
│ ├── 1_body
│ │ ├── 0_p.xml
│ │ ├── 1_p.xml
│ │ ├── 2_p.xml
===
│ │ ├── 3_results
│ │ │ ├── 0_title.xml
│ │ │ ├── 1_p.xml
│ │ │ ├── 2_p.xml
│ │ │ └── 3_p.xml
===
│ │ ├── 4_discussion
│ │ │ ├── 0_title.xml
│ │ │ ├── 1_p.xml
│ │ │ ├── 2_p.xml
│ │ │ ├── 3_p.xml
│ │ │ ├── 4_p.xml
│ │ │ ├── 5_p.xml
│ │ │ └── 6_p.xml
===................................ lower levels not consistently named
│ │ ├── 5_methods
│ │ │ ├── 0_title.xml
│ │ │ ├── 1_material_and_synthesis
│ │ │ │ ├── 0_title.xml
│ │ │ │ └── 1_p.xml
│ │ │ ├── 2_material_characterization
│ │ │ │ ├── 0_title.xml
│ │ │ │ └── 1_p.xml
│ │ │ └── 3_electrochemical_measureme
│ │ │ ├── 0_title.xml
│ │ │ └── 1_p.xml
===
│ │ ├── 6_author_contributions
│ │ │ ├── 0_title.xml
│ │ │ └── 1_p.xml
===
│ │ └── 7_supplementary_material
│ │ └── 0_title.xml
===................................ backmatter
│ ├── 2_back
===................................ acknowledgements
│ │ ├── 0_ack.xml
===................................ references
│ │ └── 1_ref-list
│ │ ├── 0_ref.xml
│ │ ├── 10_ref.xml
...
│ │ └── 9_ref.xml
===................................ original container of tables, figures, supplementary
│ ├── 3_floats-group
│ │ └── 6_supplementary-material.xml
===................................ figure captions
│ ├── figures
│ │ ├── figure_1.html
│ │ ├── figure_1.xml
...
│ │ ├── figure_6.html
│ │ ├── figure_6.xml
│ │ └── summary.html
│ └── supplementary
│ ├── summary.html
│ ├── supplementary_6.html
│ └── supplementary_6.xml
===................................ text and images from PDF
├── svg
│ ├── fulltext-page.0.svg
...
│ └── fulltext-page.6.svg
├── word.frequencies.count.xml ... OBSOLETE?
└── word.frequencies.snippets.xml
67 directories, 237 files
pm286macbook:battery10 pm286$
based on extracting methods
subtrees
@Test
public void summarizeMethods() {
String root = "methods";
String project = "summarizeProject/";
File targetDir = new File("target/"+project);
CMineTestFixtures.cleanAndCopyDir(TEST_BATTERY10, targetDir);
String cmd = "-vvv"
+ " -p "+targetDir
+ " --output " + "/sections/body/"+root
+ " summary "
+ " --glob **/PMC*/sections/*_body/*_methods/**/*_p.xml"
;
AMI.execute(cmd);
AbstractAMITest.compareDirectories(targetDir, expectedDir);
}
On the commandline this is:
ami -p <targetDir> --output /sections/body/methods
summary
--glob **/PMC*/sections/*_body/*_methods/**/*_p.xml
This extracts from:
│ ├── 1_body
│ │ ├── 0_p.xml
...
===................................ lower levels not consistently named
│ │ ├── 5_methods
│ │ │ ├── 0_title.xml
│ │ │ ├── 1_material_and_synthesis
│ │ │ │ ├── 0_title.xml
│ │ │ │ └── 1_p.xml
│ │ │ ├── 2_material_characterization
│ │ │ │ ├── 0_title.xml
│ │ │ │ └── 1_p.xml
│ │ │ └── 3_electrochemical_measureme
│ │ │ ├── 0_title.xml
│ │ │ └── 1_p.xml
Generally we split at divs so there are only <title>
and <p>
content. <p>
may contain any normal HTML (<ul>
, <span>
, various style/format). Tables are normally in <float-group>
.
methods
is generic (and may include "materials and methods" and other related concepts. But the lower titles are unlikely to be general and probably not even consistent within the discipline (electrochemistry). So here we simply ignore them and look for any content under methods
.
--glob **/PMC*/sections/*_body/*_methods/**/*_p.xml
The glob is relative to the CProject. It works by traversing the whole of the tree under targetDir
and matching the files against the glob - which is a filter rather than a template. At each file or directory the globber asks if the file matches. Because it doesn't know the context, we need the leading **
("match any"). Note: this cannot go outside the CProject
so is "safe", but don't choose a CProject of "/" (anymore than rm -rf /
which I and many have done :-)).
The levels of the glob mean:
-
**
ancestors up to the disk root. -
PMC*
all files somewhere undercProject
that start withPMC
. Avoids accessing other toplevel. This works with EPMC but is not optimal - we may change this later -
sections
. A child directory of everyPMC
that is exactly namedsections
. -
*_body
any child directory ofsections
that ends with_body
(there's only normally one) -
*_methods
any child directory of any*_body
directory. Normally only 1. -
**
any number of directories or none. There are no conventions or names or number of levels so we have to do this. -
*_p.xml
any leaf node with the name ending in_p.xml
This will retrieve just the leafnodes and "flatten" the subtree. Because there may be many files named 1_p.xml
we prepend another counter.
The result is a directory containing 27 files:
target/summarizeProject/_summary/sections/body/methods/
├── 10_3_p.xml
├── 11_1_p.xml
├── 12_2_p.xml
...
├── 25_1_p.xml
├── 26_2_p.xml
├── 27_1_p.xml
...
└── 9_1_p.xml
(These have lost the knowledge of where they came from, but I'll deal with that soon). They contain XML tags which will probably need to be removed before doing textual analysis.
Not Yet Implemented @Test public void summarizeResults() { String root = "methods"; String project = "summarizeProject/"; File targetDir = new File("target/"+project); CMineTestFixtures.cleanAndCopyDir(TEST_BATTERY10, targetDir); String cmd = "-vvv" + " -p "+targetDir + " --output " + "/sections/body/"+root + " summary " + " --glob /PMC/sections/_body/_methods/*/*_p.xml" ; AMI.execute(cmd); AbstractAMITest.compareDirectories(targetDir, expectedDir); }
### Example with flattening
This allows you to collect all files of a given type, using glob. This can be used for sections, such as methods and abstract. The output is either a tree of files, or a CSV file with filenames and content.
Typical test:
/** extracts the flattened subtree of abstracts
* and a summary.csv
*
* */
@Test
public void testSummarizeAbstracts() {
String root = "abstract";
String project = "battery10/";
File expectedDir = new File(TEST_BATTERY10+"."+"expected", project);
File targetDir = new File(TARGET_SUMMARY, project);
CMineTestFixtures.cleanAndCopyDir(TEST_BATTERY10, targetDir);
String cmd = "-vvv"
+ " -p "+targetDir
+ " --output " + "/"+root
+ " summary "
+ " --flatten"
+ " --outtype tab"
+ " --glob **/PMC*/sections/*_front/*_article-meta/*_abstract.xml"
;
AMI.execute(cmd);
AbstractAMITest.compareDirectories(targetDir, expectedDir);
}
On commandline:
ami -vvv -p myProject --output myoutput summary --flatten --outtype tab \
--glob **/PMC*/sections/*_front/*_article-meta/*_abstract.xml