Skip to content

Commit

Permalink
Add support for converting from MARCXML
Browse files Browse the repository at this point in the history
This adds support for converting MARCXML files. It uses pymarc's
`pymarc.marcxml.parse_xml_to_array` so it will read all the records in
the XML file into memory, and then convert them.

Fixes #9
  • Loading branch information
edsu committed Jan 16, 2024
1 parent e10a0af commit 2296b10
Show file tree
Hide file tree
Showing 4 changed files with 786,431 additions and 3 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Build Status](https://github.com/edsu/marctable/actions/workflows/test.yml/badge.svg)](https://github.com/edsu/marctable/actions/workflows/test.yml)

*marctable* is a Python command line utility that converts MARC bibliographic data into tabular formats like [CSV] and [Parquet]. It uses the Library of Congress [MARC Bibliographic documentation] expressed as an [Avram] [JSON file] to determine what MARC fields and subfields to include and whether they can repeat or not.
*marctable* is a Python command line utility that converts MARC bibliographic data (in transmission format or MARCXML) into tabular formats like [CSV] and [Parquet]. It uses the Library of Congress [MARC Bibliographic documentation] expressed as an [Avram] [JSON file] to determine what MARC fields and subfields to include and whether they can repeat or not.

## Install

Expand Down
10 changes: 8 additions & 2 deletions marctable/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def dataframe_iter(


def records_iter(
marc_input: BinaryIO, rules: list = [], batch: int = 1000
marc_input: BinaryIO, rules: list = [], batch: int = 1000
) -> Generator[List[Dict], None, None]:
"""
Read MARC input and generate a list of dictionaries, where each list element
Expand All @@ -84,8 +84,14 @@ def records_iter(
mapping = _mapping(rules)
marc = MARC.from_avram()

# TODO: MARCXML parsing brings all the records into memory
if marc_input.name.endswith('.xml'):
reader = pymarc.marcxml.parse_xml_to_array(marc_input)
else:
reader = pymarc.MARCReader(marc_input)

rows = []
for record in pymarc.MARCReader(marc_input):
for record in reader:
# if pymarc can't make sense of a record it returns None
if record is None:
# TODO: log this?
Expand Down
Loading

0 comments on commit 2296b10

Please sign in to comment.