Add support for converting from MARCXML

This adds support for converting MARCXML files. It uses pymarc's `pymarc.marcxml.parse_xml_to_array` so it will read all the records in the XML file into memory, and then convert them. Fixes #9
sul-dlss-labs · Jan 16, 2024 · 2296b10 · 2296b10
1 parent e10a0af
commit 2296b10
Show file tree

Hide file tree

Showing 4 changed files with 786,431 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [![Build Status](https://github.com/edsu/marctable/actions/workflows/test.yml/badge.svg)](https://github.com/edsu/marctable/actions/workflows/test.yml)
 
-*marctable* is a Python command line utility that converts MARC bibliographic data into tabular formats like [CSV] and [Parquet]. It uses the Library of Congress [MARC Bibliographic documentation] expressed as an [Avram] [JSON file] to determine what MARC fields and subfields to include and whether they can repeat or not.
+*marctable* is a Python command line utility that converts MARC bibliographic data (in transmission format or MARCXML) into tabular formats like [CSV] and [Parquet]. It uses the Library of Congress [MARC Bibliographic documentation] expressed as an [Avram] [JSON file] to determine what MARC fields and subfields to include and whether they can repeat or not.
 
 ## Install
 

diff --git a/marctable/utils.py b/marctable/utils.py
@@ -75,7 +75,7 @@ def dataframe_iter(
 
 
 def records_iter(
-    marc_input: BinaryIO, rules: list = [], batch: int = 1000
+        marc_input: BinaryIO, rules: list = [], batch: int = 1000
 ) -> Generator[List[Dict], None, None]:
     """
     Read MARC input and generate a list of dictionaries, where each list element
@@ -84,8 +84,14 @@ def records_iter(
     mapping = _mapping(rules)
     marc = MARC.from_avram()
 
+    # TODO: MARCXML parsing brings all the records into memory
+    if marc_input.name.endswith('.xml'):
+        reader = pymarc.marcxml.parse_xml_to_array(marc_input)
+    else:
+        reader = pymarc.MARCReader(marc_input)
+
     rows = []
-    for record in pymarc.MARCReader(marc_input):
+    for record in reader:
         # if pymarc can't make sense of a record it returns None
         if record is None:
             # TODO: log this?