Hi! I'm one of the programmers at Gutenberg.
We've been improving the site a lot over the past few months (and more is coming!).
If you haven't visited the page recently, it's worth checking out again: https://www.gutenberg.org/
Thanks for the free work! Project Gutenberg is nice to have :).
On the site I noticed the library boxes have roughly a single extra line causing a scrollbar to appear and the last line to be chopped off https://i.imgur.com/PQ8T0qc.png is there an issues/bug portal to properly submit these kinds of things?
When I thought about Project Gutenberg I remembered that original brutalist non-design. The current site has been very tastefully updated but looks like it's still very accessible if you turn styles off. Great job!
I uploaded a PDF to archive.org that auto-OCRs with plenty of mistakes. I have found no way of updating the entire stack of documents produced. I wonder if Project Gutenberg is similar
Perhaps you can find the information you are looking for there.
However if you plan on scraping or otherwise hitting them with a ton of traffic, consider at least to donate a good amount for the traffic you cause them. It ain't free after all.
> All Project Gutenberg metadata are available digitally in the XML/RDF format. This is updated daily (other than the legacy format mentioned below). Please use one of these files as input to a database or other tools you may be developing, instead of crawling or roboting the website.
While PG has probably gotten a lot of use and growth with the growth/maintreaming of the Internet since the 1990s, (TIL) it started back in 1971:
> Michael S. Hart began Project Gutenberg in 1971 with the digitization of the United States Declaration of Independence.[5] Hart, a student at the University of Illinois, obtained access to a Xerox Sigma V mainframe computer in the university's Materials Research Lab. […] This computer was one of the 15 nodes on ARPANET, the computer network that would become the Internet. Hart believed one day the general public would be able to access computers and decided to make works of literature available in electronic form for free. […]
"Project Gutenberg began in 1971 when Michael Hart was given an operator’s account with $100,000,000 of computer time in it by the operators of the Xerox Sigma V mainframe at the Materials Research Lab at the University of Illinois."
Project Gutenberg is a treasure trove, though many technical details defy automatic typesetting of its books. Standard Ebooks takes consistency to an unbelievable level. My post compares various sources of public domain books with an eye on typesetting:
I'm surprised no eBook Reader vendor has a Project Gutenberg "Store." Where you can just browse Gutenberg, find a book, and just grab it down to the reader. Instead, they either are actively hostile (Kindle), or require the use of Calibre (which itself is good, it is just the friction).
To be precise, the vast majority of SE is from Gutenberg, but we also source from Faded Page, Gutenberg Australia, Wikisource and occasionally do our own transcriptions.
e-book app Gutebooks (in addition to their audio app), but it seems to have been deprecated (I'm no longer able to connect to the server on my copy (which I only got 'cause there was an in-app purchase to fund Project Librivox).
FWIW, Barnes & Noble has been plundering the public domain using a book composition/keying house in the Philippines to make their public domain books which they make available in their stores --- Amazon apparently has a similar setup for the Kindle Store:
>Barnes & Noble has been plundering the public domain using a book composition/keying house in the Philippines to make their public domain books which they make available in their stores
Why is it 'plundering' for B&N to print physical books, transport them to their brick-and-mortar stores to sell? There are real costs associated to doing so. It would not have zero cost for me to print and bind a copy myself at home.
the way I see it PG is a labor of love. Bit odd if Barnes & Noble or whoever piggyback off it. But in the end - the more people read the books, the better.
It is a public good, and it would be appropos if corporations would support it directly rather than work at cross-purposes to it.
If Amazon is going to sell public domain texts, then it would make sense to source them from PG, and fund some money from those sales to the non-profit, similarly, they could then funnel reports of typos to PG for review and correction (it was a bit of a struggle the last time I tried to get a text corrected, and the project founder/director actually stepped in on my behalf).
Nice to see so much appreciation for what we do. (I'm the new-ish executive director.) Any wikipedians reading this, the article about PG is... aging. Last I looked, it said we offered Plucker files. @Jseiko has done some nice work.
Worth mentioning the Project Gutenberg ZIMs. You can download the entire ENglish Gutenberg corpus for about 60GB (English Wikipedia ZIM complete with images is ~120GB):
Looks like the top downloaded book yesterday[0] was Concrete Construction: Methods and Costs by Gillette and Hill.[1] Beat out Moby Dick, Count of Monte Cristo, Frankenstien, Romeo and Juliet, and others.
> 23644 downloads in the last 30 days.
I wonder if this is bot behavior? 23k downloads feels like a lot?
Project Gutenberg had (has?) a tendency toward plaintext that always put me off. (And it has been over a decade I'm sure since I explored the site—so I am no doubt now misinformed.)
I like a styled formatted book—would prefer PDFs. (I know, not a popular format apparently.)
I like the idea of Project Gutenberg but guess I found book scans on archive.org my preference.
My go-to example is Lewis Carroll's "Through the Looking Glass" with the fantastic art of John Tenniel and Carroll's sometimes creative formatting of the prose…
I see they (Project Gutenberg) have ePub now, which can be good if well done.
(If not well done it can be a kind of mess. Re-flowable "HTML", paginated… Anyone ever try to print a long web page and did you enjoy the result? Perhaps that is as much on the ePub reader though.)
We're supporting EPUB3 for the vast majority of books! At the same time we also have a "Plain Text" version for each as in a sense it's the most robust. PdFs are in the works!
As others here have mentioned, https://standardebooks.org/ is excellent and my understanding is that they use Gutenberg books as a source for theirs but done up much nicer.
As a Kindle user, I still miss the old version of the site. The new one looks great on normal desktop, but the old one was simple enough to load and directly download books on the device's built-in browser.
I remember printing out project Gutenberg books in the mid-90s, four regular pages to an A4 page, double-sided on my inkjet. I had a background in typography, so I made it work.
Any yes, the text needed a lot of processing to make it right.
Now, in my early fifties and with declining eyesight, that's out of reach now.
that's cool! one of my "pet-ideas" is actually to make an AI-agent that does all that typographical work for any PG book to make it nicely printable without any manual labor whatsoever. Maybe that's doable now ...
That is doable. Most of my work was regexp and repetitive stuff. And the typograhpy stuff is achievable with the current state of the art models. Not that I remember what I did, it was 30 years ago.
I'm slightly curious how PG handles heavily illustrated books. I've downloaded some years ago, and the quality of the illustrations was always pretty poor. Has it been improved lately? What's the QA like for illustrations?
Nowadays we depend on scans from Internet Archive, Hathitrust, and other sources. Some scans are better than others. Bear in mind that our illustrations need to be in the public domain and usually from the same edition as the text. https://www.gutenberg.org/help/errata.html
Not a recommendation per se but I used to use Amphetype on Gutenberg texts to practise touch-typing. There's something about writing out a book that hits differently to reading it. You skip less, odd parts stick with you.
I think the last one I tried was The Island of Dr Moreau.
just heard back that the server provider has been doing a security update. Maybe you were one of the users that got unlucky as a result... maybe try later if still interested
Every day you'll get much more than you're bargaining for, right into your feed or inbox. Easy download books you're interested in and put them on your Kindle.
good question. first though - maybe some bot has downloaded it often for whatever reasons and our systems didn't detect it as bot traffic. just a guess.
I've found that the larger open-weight AI models do a great job of explaining the old non-fiction content on PG, particularly magazine articles which are a good size for the AI to handle. It breaks down the long wall-of-text paragraphs for you and explains all the historically relevant background that would've been assumed to be known back in the day.
If you ask it to assess the relevance of the text in the present day it will also do that very nicely, highlighting the places where the text shows old-fashioned viewpoints that would be sharply criticized today.
so maybe Karpathy has a point that LLM-assisted reading should be a thing. Would be cool if that worked on E-Reader screens as well. Maybe when the browsers on E-Readers become good enough ...
On the site I noticed the library boxes have roughly a single extra line causing a scrollbar to appear and the last line to be chopped off https://i.imgur.com/PQ8T0qc.png is there an issues/bug portal to properly submit these kinds of things?
Keep up the good work!
autocat3 and gutenbergsite are repos responsible for generating gutenberg.org
(I can’t quite tell if that’s an egregious abuse of the site or you’re perfectly fine to share without human eye balls hitting your www?)
https://www.gutenberg.org/ebooks/offline_catalogs.html
Perhaps you can find the information you are looking for there.
However if you plan on scraping or otherwise hitting them with a ton of traffic, consider at least to donate a good amount for the traffic you cause them. It ain't free after all.
Don't hit the site with agent. The section furtherst bottom machine readable.
> All Project Gutenberg metadata are available digitally in the XML/RDF format. This is updated daily (other than the legacy format mentioned below). Please use one of these files as input to a database or other tools you may be developing, instead of crawling or roboting the website.
And strongly consider a donation! (My addition)
https://www.gutenberg.org/ebooks/offline_catalogs.html#the-p...
> Michael S. Hart began Project Gutenberg in 1971 with the digitization of the United States Declaration of Independence.[5] Hart, a student at the University of Illinois, obtained access to a Xerox Sigma V mainframe computer in the university's Materials Research Lab. […] This computer was one of the 15 nodes on ARPANET, the computer network that would become the Internet. Hart believed one day the general public would be able to access computers and decided to make works of literature available in electronic form for free. […]
* https://en.wikipedia.org/wiki/Project_Gutenberg
https://www.gutenberg.org/about/background/history_and_philo...
https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-p...
Technically, I can also just directly pull the epub from Project Gutenberg, but sometimes the formatting leaves a lot to be desired.
Once you get an e-reader that runs a semi-capable OS (ex - stock android, even an older version), it's hard to go back to something like a kindle.
https://www.gutenberg.org/cache/epub/1513/pg1513-images.html
https://standardebooks.org/ebooks/william-shakespeare/romeo-...
Each has its particular advantages relative to the other ...
Also one should probably compare the former to the single-page version on standardebooks: https://standardebooks.org/ebooks/william-shakespeare/romeo-...
https://librivox.org/
e-book app Gutebooks (in addition to their audio app), but it seems to have been deprecated (I'm no longer able to connect to the server on my copy (which I only got 'cause there was an in-app purchase to fund Project Librivox).
FWIW, Barnes & Noble has been plundering the public domain using a book composition/keying house in the Philippines to make their public domain books which they make available in their stores --- Amazon apparently has a similar setup for the Kindle Store:
https://www.amazon.com/Public-Domain-Books-Kindle-Store/s?k=...
Rather a shame that PG didn't monetize by putting their books up there pre-emptively.
Why is it 'plundering' for B&N to print physical books, transport them to their brick-and-mortar stores to sell? There are real costs associated to doing so. It would not have zero cost for me to print and bind a copy myself at home.
If Amazon is going to sell public domain texts, then it would make sense to source them from PG, and fund some money from those sales to the non-profit, similarly, they could then funnel reports of typos to PG for review and correction (it was a bit of a struggle the last time I tried to get a text corrected, and the project founder/director actually stepped in on my behalf).
https://play.google.com/store/apps/details?id=biz.bookdesign...
should be opensource --- it does at least work to support Project Librivox (or at least that's my understanding)
but yes, generally I agree with your point. Library of 75k books seems pretty valuable to have direct access to.
https://ebookfoundation.org/openzim.html
> 23644 downloads in the last 30 days.
I wonder if this is bot behavior? 23k downloads feels like a lot?
[0] https://www.gutenberg.org/browse/scores/top [1] https://www.gutenberg.org/ebooks/24855
I like a styled formatted book—would prefer PDFs. (I know, not a popular format apparently.)
I like the idea of Project Gutenberg but guess I found book scans on archive.org my preference.
My go-to example is Lewis Carroll's "Through the Looking Glass" with the fantastic art of John Tenniel and Carroll's sometimes creative formatting of the prose…
I see they (Project Gutenberg) have ePub now, which can be good if well done.
(If not well done it can be a kind of mess. Re-flowable "HTML", paginated… Anyone ever try to print a long web page and did you enjoy the result? Perhaps that is as much on the ePub reader though.)
https://www.fadedpage.com/ from Canada I think
https://runeberg.org/ from Sweden
The previous version of the site had two major flaws:
1. The search bar had been removed from the top of the page, and hidden behind a "Click here to search" (or similar) link partway down the page
2. Once you opened that page, the coloring of the site was so washed out on e-ink that the text input was hard to find.
Thanks for fixing it!
You can download books in most browsers. I know Amazon have done things to make life difficult for other stores in the past.
Any yes, the text needed a lot of processing to make it right.
Now, in my early fifties and with declining eyesight, that's out of reach now.
Thanks for sticking with the project!
I've heard good things. Also - Sherlock Holmes :)
https://www.gutenberg.org/ebooks/feeds.html
Every day you'll get much more than you're bargaining for, right into your feed or inbox. Easy download books you're interested in and put them on your Kindle.
https://onlinebooks.library.upenn.edu/new.html
could be a trick to ease that fear :D
If you ask it to assess the relevance of the text in the present day it will also do that very nicely, highlighting the places where the text shows old-fashioned viewpoints that would be sharply criticized today.