Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Creating a DocumentFragment results in a slow memory leak #20

Closed
rgrove opened this issue Feb 22, 2015 · 12 comments
Closed

Creating a DocumentFragment results in a slow memory leak #20

rgrove opened this issue Feb 22, 2015 · 12 comments

Comments

@rgrove
Copy link
Contributor

rgrove commented Feb 22, 2015

Creating a DocumentFragment from a Nokogumbo document appears to result in a slow memory leak. The larger the HTML input, the larger the leak is. Here's a simple repro case (I tested against Nokogumbo 1.3.0):

#!/usr/bin/env ruby
# encoding: utf-8
require 'nokogumbo'

html = %[
  <p><b>Rome</b> (,  , ) is a city and special <i><a href="comune">comune</a></i> (named "Roma Capitale") in <a href="Italy">Italy</a>. Rome is the capital of Italy and <a href="Regions of Italy">region</a> of <a href="Lazio">Lazio</a>. With 2.9&nbsp;million residents in , it is also the country's largest and most populated <i>comune</i> and <a href="Largest cities of the European Union by population within city limits">fourth-most populous city</a> in the European Union by population within city limits. The <a href="Metropolitan City of Rome">Metropolitan City of Rome</a> has a population of 4.3&nbsp;million residents.<sup class="reference" id="cite_ref-PR_2-1">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-PR_2">2</a>]</sup> The city is located in the central-western portion of the <a href="Italian Peninsula">Italian Peninsula</a>, within Lazio (Latium), along the shores of <a href="Tiber">Tiber</a> river. <a href="Vatican City">Vatican City</a> is an independent country within the city boundaries of Rome, the only existing example of a country within a city: for this reason Rome has been often defined as capital of two states.<sup class="reference" id="cite_ref-3-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-3">3</a>]</sup><sup class="reference" id="cite_ref-4-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-4">4</a>]</sup></p>
  <p><a href="History of Rome">Rome's history</a> spans <a href="List of cities by time of continuous habitation">more than two and a half thousand years</a>. Although Roman tradition states the founding of Rome around 753 BC, the site has been inhabited much earlier, being one of the oldest continuously occupied cities in Europe.<sup class="reference" id="cite_ref-5-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-5">5</a>]</sup> The city's early population originated from a mix of <a href="Latins">Latins</a>, <a href="Etruscan civilization">Etruscans</a> and <a href="Sabines">Sabines</a>. Eventually, the city  successively became the capital of the <a href="Roman Kingdom">Roman Kingdom</a>, the <a href="Roman Republic">Roman Republic</a> and the <a href="Roman Empire">Roman Empire</a>, and is regarded as one of the birthplaces of <a href="Western culture">Western civilization</a>. It is referred to as "Roma Aeterna" (The Eternal City) <sup class="reference" id="cite_ref-6-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-6">6</a>]</sup> and "<a href="Caput Mundi">Caput Mundi</a>" (Capital of the World), two central notions in ancient Roman culture.</p>
  <p>After the <a href="Fall of the Roman Empire">Fall of the Empire</a>, which marked the begin of the <a href="Middle Ages">Middle Ages</a>, Rome slowly fell under the political control of the <a href="Pope">Pope</a>, which had settled in the city since the 1st century AD, until in the 8th century it became the capital of the <a href="Papal States">Papal States</a>, which lasted until 1870.</p>
  <p>Beginning with the <a href="Renaissance">Renaissance</a>, almost all the popes since <a href="Pope Nicholas V">Nicholas V</a> (1422–55) pursued coherently along four hundred years an architectonic and urbanistic program aimed to make of the city the world's artistic and cultural center.<sup class="reference" id="cite_ref-7-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-7">7</a>]</sup> Due to that, Rome became first one of the major centers of the <a href="Italian Renaissance">Italian Renaissance</a>,<sup class="reference" id="cite_ref-8-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-8">8</a>]</sup> and then the birthplace of the <a href="Baroque">Baroque</a> style. Famous artists and architects of the Renaissance and Baroque period made Rome the center of their activity, creating masterpieces throughout the city. In 1871 Rome became the capital of the <a href="Kingdom of Italy (1861–1946)">Kingdom of Italy</a>, and in 1946 that of the <a href="Italian Republic">Italian Republic</a>.</p>
  <p>Rome has the status of a <a href="global city">global city</a>.<sup class="reference" id="cite_ref-lboro.ac.uk_9-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-lboro.ac.uk_9">9</a>]</sup><sup class="reference" id="cite_ref-10-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-10">10</a>]</sup><sup class="reference" id="cite_ref-atkearney.at_11-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-atkearney.at_11">11</a>]</sup> In 2011, Rome was the 18th-most-visited city in the world, 3rd most visited in the <a href="European Union">European Union</a>, and the most popular tourist attraction in Italy.<sup class="reference" id="cite_ref-Caroline Bremner_12-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-Caroline Bremner_12">12</a>]</sup> Its historic centre is listed by <a href="UNESCO">UNESCO</a> as a <a href="World Heritage Site">World Heritage Site</a>.<sup class="reference" id="cite_ref-whc.unesco.org_13-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-whc.unesco.org_13">13</a>]</sup> Monuments and museums such as the <a href="Vatican Museums">Vatican Museums</a> and the <a href="Colosseum">Colosseum</a> are among the world's most visited tourist destinations with both locations receiving millions of tourists a year. Rome hosted the <a href="1960 Summer Olympics">1960 Summer Olympics</a> and is the seat of United Nations' <a href="Food and Agriculture Organization">Food and Agriculture Organization</a> (FAO).</p>
  <p><table id="toc" class="toc" summary="Contents"><tr><td><div id="toctitle"><h2>Table of Contents</h2></div><ul><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Etymology" rel="nofollow" title="#Etymology">#Etymology</a>">Etymology</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23History" rel="nofollow" title="#History">#History</a>">History</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Earliest_history" rel="nofollow" title="#Earliest_history">#Earliest_history</a>">Earliest history</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Legend_of_the_founding_of_Rome" rel="nofollow" title="#Legend_of_the_founding_of_Rome">#Legend_of_the_founding_of_Rome</a>">Legend of the founding of Rome</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Monarchy_republic_empire" rel="nofollow" title="#Monarchy_republic_empire">#Monarchy_republic_empire</a>">Monarchy, republic, empire</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Middle_Ages" rel="nofollow" title="#Middle_Ages">#Middle_Ages</a>">Middle Ages</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Early_modern" rel="nofollow" title="#Early_modern">#Early_modern</a>">Early modern</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Late_modern_and_contemporary" rel="nofollow" title="#Late_modern_and_contemporary">#Late_modern_and_contemporary</a>">Late modern and contemporary</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Government" rel="nofollow" title="#Government">#Government</a>">Government</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Local_government" rel="nofollow" title="#Local_government">#Local_government</a>">Local government</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Administrative_and_historical_subdivisions" rel="nofollow" title="#Administrative_and_historical_subdivisions">#Administrative_and_historical_subdivisions</a>">Administrative and historical subdivisions</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Metropolitan_and_regional_government" rel="nofollow" title="#Metropolitan_and_regional_government">#Metropolitan_and_regional_government</a>">Metropolitan and regional government</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23National_government" rel="nofollow" title="#National_government">#National_government</a>">National government</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Geography" rel="nofollow" title="#Geography">#Geography</a>">Geography</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Location" rel="nofollow" title="#Location">#Location</a>">Location</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Topography" rel="nofollow" title="#Topography">#Topography</a>">Topography</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Climate" rel="nofollow" title="#Climate">#Climate</a>">Climate</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Demographics" rel="nofollow" title="#Demographics">#Demographics</a>">Demographics</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Ethnic_groups" rel="nofollow" title="#Ethnic_groups">#Ethnic_groups</a>">Ethnic groups</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Religion" rel="nofollow" title="#Religion">#Religion</a>">Religion</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Vatican_City" rel="nofollow" title="#Vatican_City">#Vatican_City</a>">Vatican City</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Pilgrimage" rel="nofollow" title="#Pilgrimage">#Pilgrimage</a>">Pilgrimage</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Cityscape" rel="nofollow" title="#Cityscape">#Cityscape</a>">Cityscape</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Architecture" rel="nofollow" title="#Architecture">#Architecture</a>">Architecture</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Ancient_Rome" rel="nofollow" title="#Ancient_Rome">#Ancient_Rome</a>">Ancient Rome</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Medieval" rel="nofollow" title="#Medieval">#Medieval</a>">Medieval</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Renaissance_and_Baroque" rel="nofollow" title="#Renaissance_and_Baroque">#Renaissance_and_Baroque</a>">Renaissance and Baroque</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Neoclassicism" rel="nofollow" title="#Neoclassicism">#Neoclassicism</a>">Neoclassicism</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Fascist_architecture" rel="nofollow" title="#Fascist_architecture">#Fascist_architecture</a>">Fascist architecture</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Parks_and_gardens" rel="nofollow" title="#Parks_and_gardens">#Parks_and_gardens</a>">Parks and gardens</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Fountains_and_aqueducts" rel="nofollow" title="#Fountains_and_aqueducts">#Fountains_and_aqueducts</a>">Fountains and aqueducts</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Statues" rel="nofollow" title="#Statues">#Statues</a>">Statues</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Obelisks_and_columns" rel="nofollow" title="#Obelisks_and_columns">#Obelisks_and_columns</a>">Obelisks and columns</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Bridges" rel="nofollow" title="#Bridges">#Bridges</a>">Bridges</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Catacombs" rel="nofollow" title="#Catacombs">#Catacombs</a>">Catacombs</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Economy" rel="nofollow" title="#Economy">#Economy</a>">Economy</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Education" rel="nofollow" title="#Education">#Education</a>">Education</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Culture" rel="nofollow" title="#Culture">#Culture</a>">Culture</a><ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Entertainment_and_performing_arts" rel="nofollow" title="#Entertainment_and_performing_arts">#Entertainment_and_performing_arts</a>">Entertainment and performing arts</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Tourism" rel="nofollow" title="#Tourism">#Tourism</a>">Tourism</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Cuisine" rel="nofollow" title="#Cuisine">#Cuisine</a>">Cuisine</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Cinema" rel="nofollow" title="#Cinema">#Cinema</a>">Cinema</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Language" rel="nofollow" title="#Language">#Language</a>">Language</a></li></ul><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Sports" rel="nofollow" title="#Sports">#Sports</a>">Sports</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Transport" rel="nofollow" title="#Transport">#Transport</a>">Transport</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23International_entities_organisations_and_involvement" rel="nofollow" title="#International_entities_organisations_and_involvement">#International_entities_organisations_and_involvement</a>">International entities, organisations and involvement</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Twin_towns_sister_cities_and_partner_cities" rel="nofollow" title="#Twin_towns_sister_cities_and_partner_cities">#Twin_towns_sister_cities_and_partner_cities</a>">Twin towns, sister cities and partner cities</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23See_also" rel="nofollow" title="#See_also">#See_also</a>">See also</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23References" rel="nofollow" title="#References">#References</a>">References</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Bibliography" rel="nofollow" title="#Bibliography">#Bibliography</a>">Bibliography</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23Documentaries" rel="nofollow" title="#Documentaries">#Documentaries</a>">Documentaries</a></li><li><a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23External_links" rel="nofollow" title="#External_links">#External_links</a>">External links</a></li></ul></ul></td></tr></table></p>
  <h2><span class="editsection">&#91;<a href="?section=Etymology" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Etymology"></a><span class="mw-headline" id="Etymology">Etymology</span></h2>
  <p>About the origin of the name <i>Roma</i> several hypotheses have been advanced.<sup class="reference" id="cite_ref-14-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-14">14</a>]</sup> The most important are the following:
  <ul><li>From <i>Rumon</i> or <i>Rumen</i>, archaic name of the <a href="Tiber">Tiber</a>, which in turn has the same root as the Greek verb ῥέω (rhèo) and the Latin verb <i>ruo</i>, which both mean "flow";<sup class="reference" id="cite_ref-15-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-15">15</a>]</sup></li><li>From the <a href="Etruscan language">Etruscan</a> word <i>ruma</i>, whose root is *rum- "teat", with possible reference either to the <a href="Founding of Rome#The legend">totem wolf that adopted and suckled</a> the cognately named twins <a href="Romulus and Remus">Romulus and Remus</a>, or to the shape of the <a href="Palatine Hill">Palatine</a> and <a href="Aventine Hill">Aventine Hills</a>;</li><li>From the Greek word ῤώμη (rh�mē), which means <i>strength</i>.<sup class="reference" id="cite_ref-16-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-16">16</a>]</sup></li></ul></p>
  <h2><span class="editsection">&#91;<a href="?section=History" title="Edit section: Etymology">edit</a>&#93;</span> <a name="History"></a><span class="mw-headline" id="History">History</span></h2>
  <h3><span class="editsection">&#91;<a href="?section=Earliest_history" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Earliest_history"></a><span class="mw-headline" id="Earliest_history">Earliest history</span></h3>
  <p>There is archaeological evidence of human occupation of the Rome area from approximately 14,000 years ago, but the dense layer of much younger debris obscures Palaeolithic and Neolithic sites.<sup class="reference" id="cite_ref-17-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-17">17</a>]</sup> Evidence of stone tools, pottery and stone weapons attest to about 10,000 years of human presence. Several excavations support the view that Rome grew from <a href="pastoralism">pastoral</a> settlements on the <a href="Palatine Hill">Palatine Hill</a> built above the area of the future <a href="Roman Forum">Roman Forum</a>. While some archaeologists argue that Rome was indeed founded in the middle of the 8th century BC (the date of the tradition), the date is subject to controversy.<sup class="reference" id="cite_ref-foundation_18-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-foundation_18">18</a>]</sup> However, the power of the well known tale of Rome's legendary foundation tends to deflect attention from its actual, and much more ancient, origins.</p>
  <h4><span class="editsection">&#91;<a href="?section=Legend_of_the_founding_of_Rome" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Legend_of_the_founding_of_Rome"></a><span class="mw-headline" id="Legend_of_the_founding_of_Rome">Legend of the founding of Rome</span></h4>
  <p><div class="thumb tright"><div class="thumbinner" style="width: 180px;"><a href="File:She-wolf suckles Romulus and Remus.jpg" class="image" title="Capitoline Wolf suckles the infant twins Romulus and Remus."><img src="She-wolf suckles Romulus and Remus.jpg" alt="Capitoline Wolf suckles the infant twins Romulus and Remus." title="Capitoline Wolf suckles the infant twins Romulus and Remus." style="float:right" /></a><div class="thumbcaption"><a href="Capitoline Wolf">Capitoline Wolf</a> suckles the infant twins <a href="Romulus and Remus">Romulus and Remus</a>.</div></div></div> Traditional stories handed down by the <a href="ancient Romans">ancient Romans</a> themselves explain the earliest <a href="History of Rome">history of their city</a> in terms of <a href="legend">legend</a> and <a href="myth">myth</a>. The most familiar of these myths, and perhaps the most famous of all <a href="Roman mythology">Roman myths</a>, is the story of <a href="Romulus and Remus">Romulus and Remus</a>, the twins who were suckled by a <a href="wolf">she-wolf</a>.<sup class="reference" id="cite_ref-livy1797_19-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-livy1797_19">19</a>]</sup> They decided to build a city, but after an argument, <a href="Romulus">Romulus</a> killed his brother. According to the Roman <a href="annalist">annalists</a>, this happened on 21 April 753 BC.<sup class="reference" id="cite_ref-awg73_20-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg73_20">20</a>]</sup> This legend had to be reconciled with a dual tradition, set earlier in time, that had the <a href="Trojan War">Trojan refugee</a> <a href="Aeneas">Aeneas</a> escape to Italy and found the line of Romans through his son <a href="Ascanius">Iulus</a>, the namesake of the <a href="Julio-Claudian dynasty">Julio-Claudian dynasty</a>.<sup class="reference" id="cite_ref-livy2005_21-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-livy2005_21">21</a>]</sup> This was accomplished by the Roman poet <a href="Virgil">Virgil</a> in the first century BC.</p>
  <h3><span class="editsection">&#91;<a href="?section=Monarchy_republic_empire" title="Edit section: Etymology">edit</a>&#93;</span> <a name="Monarchy_republic_empire"></a><span class="mw-headline" id="Monarchy_republic_empire">Monarchy, republic, empire</span></h3>
  <p>After the legendary foundation by <a href="Romulus">Romulus</a>,<sup class="reference" id="cite_ref-awg73_20-1">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg73_20">20</a>]</sup> Rome was ruled for a period of 244 years by a monarchical system, initially with sovereigns of <a href="Latins (Italic tribe)">Latin</a> and <a href="Sabines">Sabine</a> origin, later by <a href="Etruscans">Etruscan</a> kings. The tradition handed down seven kings: Romulus, <a href="Numa Pompilius">Numa Pompilius</a>, <a href="Tullus Hostilius">Tullus Hostilius</a>, <a href="Ancus Marcius">Ancus Marcius</a>, <a href="Tarquinius Priscus">Tarquinius Priscus</a>, <a href="Servius Tullius">Servius Tullius</a> and <a href="Tarquin the Proud">Tarquin the Proud</a>.<sup class="reference" id="cite_ref-awg73_20-2">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg73_20">20</a>]</sup></p>
  <p>In 509 BC the Romans expelled from the city the last king and established an oligarchic republic: since then, for Rome began a period characterized by internal struggles between <a href="Patrician (ancient Rome)">patricians</a> (aristocrats) and <a href="Plebs">plebeians</a> (small landowners), and by constant warfare against the populations of central Italy: Etruscans, Latins, <a href="Volsci">Volsci</a>, <a href="Aequi">Aequi</a>.<sup class="reference" id="cite_ref-awg77_22-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg77_22">22</a>]</sup> After becoming master of <a href="Latium">Latium</a>, Rome led several wars (against the <a href="Gauls">Gauls</a>, <a href="Osci">Osci</a>-<a href="Samnites">Samnites</a> and the Greek colony of <a href="Taranto">Taranto</a>, allied with <a href="Pyrrhus of Epirus">Pyrrhus</a>, king of <a href="Epirus">Epirus</a>) whose result was the conquest of the <a href="Italian peninsula">Italian peninsula</a>, from the central area up to <a href="Magna Graecia">Magna Graecia</a>.<sup class="reference" id="cite_ref-awg79_23-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg79_23">23</a>]</sup></p>
  <p>The third and second century BC saw the establishment of the Roman hegemony over the Mediterranean and the East, through the three <a href="Punic Wars">Punic Wars</a> (264-146 BC) fought against the city of <a href="Carthage">Carthage</a> and the three <a href="Macedonian Wars">Macedonian Wars</a> (212-168 BC) against <a href="Macedonia (ancient kingdom)">Macedonia</a>.<sup class="reference" id="cite_ref-awg8183_24-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg8183_24">24</a>]</sup> Then were established the first <a href="Roman province">Roman provinces</a>: <a href="Sicilia (Roman province)">Sicily</a>, <a href="Corsica et Sardinia">Sardinia and Corsica</a>, <a href="Hispania">Spain</a>, <a href="Macedonia (Roman province)">Macedonia</a>, <a href="Achaea (Roman province)">Greece (Achaia)</a>, <a href="Africa (Roman province)">Africa</a>.<sup class="reference" id="cite_ref-awg8185_25-0">[<a href="<a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23cite_note" rel="nofollow" title="#cite_note">#cite_note</a>-awg8185_25">25</a>]</sup></p>
]

200_000.times do |i|
  frag = Nokogiri::HTML5.fragment(html)

  if i % 1000 == 0
    GC.start
    puts "#{i} Memory: #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts
  end
end

After 100,000 iterations, this process consumes 87MB on my system.

The leak isn't specific to the Nokogiri::HTML5.fragment method. Creating a fragment manually also leaks:

200_000.times do |i|
  doc = Nokogiri::HTML5.parse("<html><body>#{html}")
  frag = doc.fragment
  frag << doc.xpath('/html/body/node()')

  if i % 1000 == 0
    GC.start
    puts "#{i} Memory: #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts
  end
end

Vanilla Nokogiri doesn't leak:

200_000.times do |i|
  frag = Nokogiri::HTML.fragment(html)

  if i % 1000 == 0
    GC.start
    puts "#{i} Memory: #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts
  end
end

I tried poking around a bit to see if I could spot an obvious cause, but C isn't my forte.

@fabn
Copy link

fabn commented Nov 18, 2015

@rubys any chance to get this fixed?

I'm still experiencing this and since I'm having a lot of background jobs that process html this issue eats up all available memory. The only way to fix it is to restart sidekiq.

@rubys
Copy link
Owner

rubys commented Nov 20, 2015

I don't know how to proceed with this. Let me describe what is happening, in parse (which people should be able to follow, even if they don't know C):

static VALUE parse(VALUE self, VALUE string) {

  1. First the gumbo parser is called, which creates a data structure
  2. Then a new (empty) libxml2 document is created
  3. Then I call walk tree which copies data from the gumbo data structure to the libxml2 data structure
  4. The root element is then set, as well as the doctype
  5. The gumbo data structure is destroyed
  6. The libxml2 data structure is "wrapped" by a Nokogiri object.

Possibilities include: the gumbo parser is what is leaking (i.e., the call to gumbo_destroy_output doesn't clean up everything), and wrapping the libxml2 data structure using the nokogiri call is somehow different than what nokogiri itself does.

@rubys
Copy link
Owner

rubys commented Nov 20, 2015

One additional comment: there is only one call to malloc in my code: that deals with a case involving xml namespaces and that string is freed immediately after the loop in which it is allocated is exited. My conclusion is that the leak is either in memory allocated by Nokogiri/libxml2, or in memory allocated by Gumbo.

@Aqualon
Copy link

Aqualon commented Aug 13, 2018

We also hit this issue after upgrading to Sanitize 4.6.5 (from 2.x) due to a security fix. Our memory usage started to increase quite dramatically if we didn't deploy for a longer time:

Memory graph

We switch now over to rails-html-sanitizer gem instead.

@stevecheckoway
Copy link
Collaborator

I spent a few hours looking into this and I'm afraid I don't know where the memory leak is.

Before starting my investigation, I could think of a handful of possible causes.

  1. Nokogiri is leaking
  2. Ruby objects are leaking
  3. Gumbo is leaking
  4. Nokogumbo is misusing the libxml2 or gumbo APIs
  5. Nokogumbo itself is leaking

First, I am able to reproduce the slow memory leak using @rgrove's code (thanks for that!). It starts off taking about 55 MB of RAM on my machine and slowly increases. By contrast, Nokogiri's HTML.fragment method takes about 35 MB and is constant. I don't know what accounts for the extra 20 MB. This suggests Nokogiri isn't at fault here.

Second, I used Ruby's GC::Profiler on a slightly modified version of @rgrove's code which didn't invoke the garbage collector manually and didn't run ps. The result was a constant total (managed) heap size and a constant object count. (The heap use size (:HEAP_USE_SIZE) seemed to cycle between a small handful of fixed values. I didn't investigate that, but I assume it's due to when the garbage collector runs.) This suggests that we're not leaking Ruby objects.

Third, Gumbo's GumboOptions struct contains members for an allocator and deallocator. By default, these are wrappers for malloc and free. I wrote new wrappers that print the address of each allocation and deallocation to stderr. I wrote a simple Python program to analyze the output. I ran Nokogumbo's tests as well as 1000 iterations of @rgrove's loop. Each allocation is properly deallocated. In the latter case, that worked out to 9024000 allocations and 9024000 corresponding deallocations. This suggests that Gumbo is not leaking and that Nokogumbo isn't misusing the Gumbo API.

Fourth, it's unlikely that Nokogumbo itself is leaking (apart from misusing the libxml2 API). As @rubys points out, there's a single call to malloc and it is properly paired with a call to free.

So where does this leave us? It's possible there's a misuse of the libxml2 API. I don't know what the object ownership model of libxml2 is. I assume that the document owns all of its nodes but the individual function documentation isn't great.

Nokogumbo is pretty simple and its use of the libxml2 API is also pretty simple: It creates a document, it creates an "internal subset" which I think is just a DTD, and it creates some nodes, and properties. The functions it uses are the following.

  • xmlNewDoc
  • xmlCreateIntSubset
  • xmlDocSetRootElement
  • xmlNewDocNode
  • xmlNewDocText
  • xmlNewCDataBlock
  • xmlNewDocComment
  • xmlAddChild
  • xmlNewProp

I checked that xmlDocSetRootElement isn't returning a root element that could be leaking. xmlCreateIntSubset stores the created DTD inside the document (see https://github.com/GNOME/libxml2/blob/master/tree.c#L1006).

The results of xmlNewDocNode, xmlNewDocText, xmlNewCDataBlock, and xmlNewDocComment are added to their parent elements via xmlAddChild. And finally, xmlNewProp attaches the newly created attribute to the element (see https://github.com/GNOME/libxml2/blob/master/tree.c#L1905).

Oh. Having done all of this and written it all up, I've found where the leaks occur. Using leaks on macOS, the two leaks are occurring with these stack traces.

Leak: 0x7fb5c7c468e0  size=16  zone: DefaultMallocZone_0x106865000
        Call stack: [thread 0x7fff98b0b380]: | 0x7fff60324015 (libdyld.dylib) start | 0x10675af3b (ruby) main | 0x106a8471d (libruby.2.4.dylib) ruby_run_node | 0x106a847ec (libruby.2.4.dylib) ruby_exec_internal | 0x106b79df4 (libruby.2.4.dylib) vm_exec | 0x106b6e158 (libruby.2.4.dylib) vm_exec_core | 0x106b7d729 (libruby.2.4.dylib) vm_call_cfunc | 0x106ac9dbb (libruby.2.4.dylib) int_dotimes | 0x106b76937 (libruby.2.4.dylib) rb_yield_1 | 0x106b830a2 (libruby.2.4.dylib) invoke_block_from_c_splattable | 0x106b79df4 (libruby.2.4.dylib) vm_exec | 0x106b6e781 (libruby.2.4.dylib) vm_exec_core | 0x106b7d729 (libruby.2.4.dylib) vm_call_cfunc | 0x106fd42e3 (nokogumboc.bundle) parse | 0x106b763d6 (libruby.2.4.dylib) rb_funcall | 0x106b82ec6 (libruby.2.4.dylib) rb_call0 | 0x106b82883 (libruby.2.4.dylib) vm_call0_body | 0x106c72e2a (nokogiri.bundle) new | 0x106d02e99 (nokogiri.bundle) htmlNewDoc | 0x106d02c5d (nokogiri.bundle) htmlNewDocNoDtD | 0x106ccf8c7 (nokogiri.bundle) xmlCreateIntSubset | 0x106d64e6a (nokogiri.bundle) xmlStrdup | 0x106d64d8d (nokogiri.bundle) xmlStrndup | 0x106a99de2 (libruby.2.4.dylib) objspace_xmalloc0 | 0x7fff604cc4c7 (libsystem_malloc.dylib) malloc | 0x7fff604cd1e1 (libsystem_malloc.dylib) malloc_zone_malloc

Leak: 0x7fb5c7c48310  size=48  zone: DefaultMallocZone_0x106865000
        Call stack: [thread 0x7fff98b0b380]: | 0x7fff60324015 (libdyld.dylib) start | 0x10675af3b (ruby) main | 0x106a8471d (libruby.2.4.dylib) ruby_run_node | 0x106a847ec (libruby.2.4.dylib) ruby_exec_internal | 0x106b79df4 (libruby.2.4.dylib) vm_exec | 0x106b6e158 (libruby.2.4.dylib) vm_exec_core | 0x106b7d729 (libruby.2.4.dylib) vm_call_cfunc | 0x106ac9dbb (libruby.2.4.dylib) int_dotimes | 0x106b76937 (libruby.2.4.dylib) rb_yield_1 | 0x106b830a2 (libruby.2.4.dylib) invoke_block_from_c_splattable | 0x106b79df4 (libruby.2.4.dylib) vm_exec | 0x106b6e781 (libruby.2.4.dylib) vm_exec_core | 0x106b7d729 (libruby.2.4.dylib) vm_call_cfunc | 0x106fd42e3 (nokogumboc.bundle) parse | 0x106b763d6 (libruby.2.4.dylib) rb_funcall | 0x106b82ec6 (libruby.2.4.dylib) rb_call0 | 0x106b82883 (libruby.2.4.dylib) vm_call0_body | 0x106c72e2a (nokogiri.bundle) new | 0x106d02e99 (nokogiri.bundle) htmlNewDoc | 0x106d02c5d (nokogiri.bundle) htmlNewDocNoDtD | 0x106ccf9aa (nokogiri.bundle) xmlCreateIntSubset | 0x106d64e6a (nokogiri.bundle) xmlStrdup | 0x106d64d8d (nokogiri.bundle) xmlStrndup | 0x106a99de2 (libruby.2.4.dylib) objspace_xmalloc0 | 0x7fff604cc4c7 (libsystem_malloc.dylib) malloc | 0x7fff604cd1e1 (libsystem_malloc.dylib) malloc_zone_malloc

I have no idea why parse should be calling new in Nokogiri.

Well now I have even less idea what's going on. I ran bundle exec rake clean followed by bundle exec rake and now it's using significantly more memory, leaking faster, and leaks reports a leak in a totally different function, not in parse at all.

@stevecheckoway
Copy link
Collaborator

Oh. If Nokogumbo was compiled without libxml2 headers, then parse calls

static VALUE xmlNewDoc(char* version) {
  VALUE doc = rb_funcall(Document, new, 0);
  rb_funcall(rb_funcall(doc, internal_subset, 0), remove_, 0);
  return doc;
}

I bet that's part of what is causing the leak! I have no idea why it would have been built that way. I don't know why it's using more memory now that I'm not building it that way. I'll have to investigate later. I don't have any more time to spend right now.

@stevecheckoway
Copy link
Collaborator

At least one leak is a bug in Nokogiri. I'm not sure if there's a workaround or not. And I appear to have found a second leak, but I haven't tracked down its source and I really do need to stop working on this now.

@flavorjones
Copy link
Collaborator

If you think there's a memory leak in Nokogiri, kindly report it with as much information as you have and we'll look into it.

@stevecheckoway
Copy link
Collaborator

@flavorjones I did here.

@flavorjones
Copy link
Collaborator

Ah, I see you've reported it at sparklemotion/nokogiri#1784, will take a look as soon as I can (on vacation at the moment)

@stevecheckoway
Copy link
Collaborator

I'm pretty sure that I've worked out where the bug in Nokogiri is. I left a detailed comment here. That said, it should only be an issue if you're compiling Nokogumbo without the libxml2 headers. The 2.0.0 alpha version that's currently on Rubygems tries pretty hard to not do that unless you explicitly request it.

I'll try to verify that this is the case soon, but if anyone affected by this issue would care to try that gem out and give feedback, that'd be fantastic.

@stevecheckoway
Copy link
Collaborator

I believe @flavorjones fixed this several years ago. I'm going to close the issue. Feel free to re-open if this hasn't fixed it and I'll investigate further. Thanks again for the helpful test case!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants