hacker-news-gpt-2

Dump of generated texts from gpt-2-simple trained on Hacker News titles until April 25th, 2019 (about 603k titles, 30MB of text) for 36,813 steps (12 hours w/ a P100 GPU, costing ~$6). The output is definitely not similar to that of Markov chains.

For each temperature, there are 20 dumps of 1,000 titles (you can see some good curated titles in the good_XXX.txt files). The higher the temperature, the crazier the text.

temp_0_7: Normal and syntactically correct, but the AI sometimes copies existing titles verbatim. I recommend checking against HN Search.
temp_1_0: Crazier, mostly syntactically correct. Funnier IMO. Almost all titles are unique and have not been posted on HN before.
temp_1_3: Even more crazy, occasionally syntactically correct.

The top_p variants are generated with the same temperature using nucleus sampling at 0.9. The results are slightly crazier at each corresponding temperature, but not off-the-rails.

How To Get the Text and Train the Model

The Hacker News titles were retrieved from BigQuery (w/ a trick to decode HTML entities that occasionally clutter BQ data):

CREATE TEMPORARY FUNCTION HTML_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
var decodeHtmlEntity = function(str) {
  return str.replace(/&#(\\d+);/g, function(match, dec) {
    return String.fromCharCode(dec);
  });
};
  try { 
    return decodeHtmlEntity(enc);;
  } catch (e) { return null }
  return null;
""";

SELECT HTML_DECODE(title)
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
AND timestamp < '2019-04-25'
AND score >= 5
ORDER BY timestamp

The file was exported as a CSV, uploaded to a GCP VM w/ P100 (120s / 100 steps), then converted to a gpt-2-simple-friendly TXT file via gpt2.encode_csv().

The training was initiated with the CLI command gpt_2_simple finetune csv_encoded.txt, and the files were generated with the CLI command gpt_2_simple generate --temperature XXX --nsamples 1000 --batch_size 25 --length 100 --prefix "<|startoftext|>" --truncate "<|endoftext|>" --include_prefix False --nfiles 10. The generated files were then downloaded locally.

Maintainer/Creator

Max Woolf (@minimaxir)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
temp_0_7		temp_0_7
temp_0_7_top_p		temp_0_7_top_p
temp_1_0		temp_1_0
temp_1_0_top_p		temp_1_0_top_p
temp_1_3		temp_1_3
temp_1_3_top_p		temp_1_3_top_p
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
good_0_7.txt		good_0_7.txt
good_1_0.txt		good_1_0.txt
pic.png		pic.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hacker-news-gpt-2

How To Get the Text and Train the Model

Maintainer/Creator

License

About

Releases

Packages

License

minimaxir/hacker-news-gpt-2

Folders and files

Latest commit

History

Repository files navigation

hacker-news-gpt-2

How To Get the Text and Train the Model

Maintainer/Creator

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages