Compare commits

..

1 Commits
0.4.1 ... 0.3.2

Author SHA1 Message Date
Daniel Quinn
717b4a2b49 Fixes #172
Introduce some creative code around setting of ALLOWED_HOSTS that defaults to ['*'].  Also added PAPERLESS_ALLOWED_HOSTS to paperless.conf.example with an explanation as to what it's for
2017-01-03 09:52:31 +00:00
49 changed files with 352 additions and 1180 deletions

View File

@@ -8,9 +8,7 @@ matrix:
env: TOXENV=py34
- python: 3.5
env: TOXENV=py35
- python: 3.6
env: TOXENV=py36
- python: 3.6
- python: 3.5
env: TOXENV=pep8
install:

View File

@@ -1,58 +1,9 @@
Changelog
#########
* 0.4.1
* Fix for `#206`_ wherein the pluggable parser didn't recognise files with
all-caps suffixes like ``.PDF``
* 0.4.0
* Introducing reminders. See `#199`_ for more information, but the short
explanation is that you can now attach simple notes & times to documents
which are made available via the API. Currently, the default API
(basically just the Django admin) doesn't really make use of this, but
`Thomas Brueggemann`_ over at `Paperless Desktop`_ has said that he would
like to make use of this feature in his project.
* 0.3.6
* Fix for `#200`_ (!!) where the API wasn't configured to allow updating the
correspondent or the tags for a document.
* The ``content`` field is now optional, to allow for the edge case of a
purely graphical document.
* You can no longer add documents via the admin. This never worked in the
first place, so all I've done here is remove the link to the broken form.
* The consumer code has been heavily refactored to support a pluggable
interface. Install a paperless consumer via pip and tell paperless about
it with an environment variable, and you're good to go. Proper
documentation is on its way.
* 0.3.5
* A serious facelift for the documents listing page wherein we drop the
tabular layout in favour of a tiled interface.
* Users can now configure the number of items per page.
* Fix for `#171`_: Allow users to specify their own ``SECRET_KEY`` value.
* Moved the dotenv loading to the top of settings.py
* Fix for `#112`_: Added checks for binaries required for document
consumption.
* 0.3.4
* Removal of django-suit due to a licensing conflict I bumped into in 0.3.3.
Note that you *can* use Django Suit with Paperless, but only in a
non-profit situation as their free license prohibits for-profit use. As a
result, I can't bundle Suit with Paperless without conflicting with the
GPL. Further development will be done against the stock Django admin.
* I shrunk the thumbnails a little 'cause they were too big for me, even on
my high-DPI monitor.
* BasicAuth support for document and thumbnail downloads, as well as the Push
API thanks to @thomasbrueggemann. See `#179`_.
* 0.3.3
* Thumbnails in the UI and a Django-suit -based face-lift courtesy of @ekw!
* Timezone, items per page, and default language are now all configurable,
also thanks to @ekw.
* 0.3.2
* Fix for `#172`_: defaulting ALLOWED_HOSTS to ``["*"]`` and allowing the
user to set her own value via ``PAPERLESS_ALLOWED_HOSTS`` should the need
* Fix for #172: defaulting ALLOWED_HOSTS to ``["*"]`` and allowing the user
to set her own value via ``PAPERLESS_ALLOWED_HOSTS`` should the need
arise.
* 0.3.1
@@ -75,8 +26,7 @@ Changelog
``paperless.conf``.
* `#148`_: The database location (sqlite) is now a variable you can set in
``paperless.conf``.
* `#146`_: Fixed a bug that allowed unauthorised access to the ``/fetch``
URL.
* `#146`_: Fixed a bug that allowed unauthorised access to the `/fetch` URL.
* `#131`_: Document files are now automatically removed from disk when
they're deleted in Paperless.
* `#121`_: Fixed a bug where Paperless wasn't setting document creation time
@@ -185,8 +135,6 @@ Changelog
.. _Tim White: https://github.com/timwhite
.. _Florian Harr: https://github.com/evils
.. _Justin Snyman: https://github.com/stringlytyped
.. _Thomas Brueggemann: https://github.com/thomasbrueggemann
.. _Paperless Desktop: https://github.com/thomasbrueggemann/paperless-desktop
.. _#20: https://github.com/danielquinn/paperless/issues/20
.. _#44: https://github.com/danielquinn/paperless/issues/44
@@ -204,15 +152,8 @@ Changelog
.. _#89: https://github.com/danielquinn/paperless/issues/89
.. _#94: https://github.com/danielquinn/paperless/issues/94
.. _#98: https://github.com/danielquinn/paperless/issues/98
.. _#112: https://github.com/danielquinn/paperless/issues/112
.. _#121: https://github.com/danielquinn/paperless/issues/121
.. _#131: https://github.com/danielquinn/paperless/issues/131
.. _#146: https://github.com/danielquinn/paperless/issues/146
.. _#148: https://github.com/danielquinn/paperless/pull/148
.. _#150: https://github.com/danielquinn/paperless/pull/150
.. _#171: https://github.com/danielquinn/paperless/issues/171
.. _#172: https://github.com/danielquinn/paperless/issues/172
.. _#179: https://github.com/danielquinn/paperless/pull/179
.. _#199: https://github.com/danielquinn/paperless/issues/199
.. _#200: https://github.com/danielquinn/paperless/issues/200
.. _#206: https://github.com/danielquinn/paperless/issues/206

View File

@@ -65,8 +65,8 @@ PAPERLESS_SHARED_SECRET=""
# cases it has proven useful to configure a lesser value.
# This setting has a high impact on the physical size of tmp page files,
# the speed of document conversion, and can affect the accuracy of OCR
# results. Individual results can vary and this setting should be tested
# thoroughly against the documents you are importing to see if it has any
# results. Individual results can vary and this setting should be tested
# thoroughly against the documents you are importing to see if it has any
# impacts either negative or positive. Testing on limited document sets has
# shown a setting of 200 can cut the size of tmp files by 1/3, and speed up
# conversion by up to 4x with little impact to OCR accuracy.
@@ -81,17 +81,13 @@ PAPERLESS_SHARED_SECRET=""
# the web for "MAGICK_TMPDIR".
#PAPERLESS_CONVERT_TMPDIR=/var/tmp/paperless
# You can specify where you want the SQLite database to be stored instead of
# You can specify where you want the SQLite database to be stored instead of
# the default location
#PAPERLESS_DBDIR=/path/to/database/file
# Override the default MEDIA_ROOT here. This is where all files are stored.
#PAPERLESS_MEDIADIR=/path/to/media
# Override the default STATIC_ROOT here. This is where all static files created
# using "collectstatic" manager command are stored.
#PAPERLESS_STATICDIR=""
# The number of seconds that Paperless will wait between checking
# PAPERLESS_CONSUMPTION_DIR. If you tend to write documents to this directory
# very slowly, you may want to use a higher value than the default (10).
@@ -99,29 +95,8 @@ PAPERLESS_SHARED_SECRET=""
# If you're planning on putting Paperless on the open internet, then you
# really should set this value to the domain name you're using. Failing to do
# so leaves you open to HTTP host header attacks:
# https://docs.djangoproject.com/en/1.10/topics/security/#host-headers-virtual-hosting
#
# so leaves you open to XSS attacks.
# Just remember that this is a comma-separated list, so "example.com" is fine,
# as is "example.com,www.example.com", but NOT " example.com" or "example.com,"
#PAPERLESS_ALLOWED_HOSTS="example.com,www.example.com"
# Override the default UTC time zone here
#PAPERLESS_TIME_ZONE=UTC
# Customize number of list items to show per page
#PAPERLESS_LIST_PER_PAGE=50
# Customize the default language that tesseract will attempt to use when parsing
# documents. It should be a 3-letter language code consistent with ISO 639.
#PAPERLESS_OCR_LANGUAGE=eng
# The number of items on each page in the web UI. This value must be a
# positive integer, but if you don't define one in paperless.conf, a default of
# 100 will be used.
#PAPERLESS_LIST_PER_PAGE=100
# The secret key has a default that should be fine so long as you're hosting
# Paperless on a closed network. However, if you're putting this anywhere
# public, you should change the key to something unique and verbose.
#PAPERLESS_SECRET_KEY="change-me"

View File

@@ -1,17 +1,16 @@
Django==1.10.5
Django==1.10.4
Pillow>=3.1.1
django-crispy-forms>=1.6.1
django-extensions>=1.7.6
django-crispy-forms>=1.6.0
django-extensions>=1.6.1
django-filter>=1.0
django-flat-responsive>=1.2.0
djangorestframework>=3.5.3
djangorestframework>=3.4.4
filemagic>=1.6
langdetect>=1.0.7
pyocr>=0.4.6
python-dateutil>=2.6.0
python-dotenv>=0.6.2
python-gnupg>=0.3.9
pytz>=2016.10
langdetect>=1.0.5
pyocr>=0.3.1
python-dateutil>=2.4.2
python-dotenv>=0.3.0
python-gnupg>=0.3.8
pytz>=2015.7
gunicorn==19.6.0
# For the tests

View File

@@ -1,4 +1,3 @@
from django.conf import settings
from django.contrib import admin
from django.contrib.auth.models import User, Group
from django.core.urlresolvers import reverse
@@ -32,25 +31,21 @@ class MonthListFilter(admin.SimpleListFilter):
return queryset.filter(created__year=year, created__month=month)
class CommonAdmin(admin.ModelAdmin):
list_per_page = settings.PAPERLESS_LIST_PER_PAGE
class CorrespondentAdmin(CommonAdmin):
class CorrespondentAdmin(admin.ModelAdmin):
list_display = ("name", "match", "matching_algorithm")
list_filter = ("matching_algorithm",)
list_editable = ("match", "matching_algorithm")
class TagAdmin(CommonAdmin):
class TagAdmin(admin.ModelAdmin):
list_display = ("name", "colour", "match", "matching_algorithm")
list_filter = ("colour", "matching_algorithm")
list_editable = ("colour", "match", "matching_algorithm")
class DocumentAdmin(CommonAdmin):
class DocumentAdmin(admin.ModelAdmin):
class Media:
css = {
@@ -58,27 +53,12 @@ class DocumentAdmin(CommonAdmin):
}
search_fields = ("correspondent__name", "title", "content")
list_display = ("title", "created", "thumbnail", "correspondent", "tags_")
list_display = ("created", "correspondent", "title", "tags_", "document")
list_filter = ("tags", "correspondent", MonthListFilter)
ordering = ["-created", "correspondent"]
def has_add_permission(self, request):
return False
list_per_page = 25
def created_(self, obj):
return obj.created.date().strftime("%Y-%m-%d")
created_.short_description = "Created"
def thumbnail(self, obj):
png_img = self._html_tag(
"img",
src="/fetch/thumb/{}".format(obj.id),
width=180,
alt="Thumbnail of {}".format(obj.file_name),
title=obj.file_name
)
return self._html_tag("a", png_img, href=obj.download_url)
thumbnail.allow_tags = True
def tags_(self, obj):
r = ""
@@ -128,7 +108,7 @@ class DocumentAdmin(CommonAdmin):
return "<{} {}/>".format(kind, " ".join(attributes))
class LogAdmin(CommonAdmin):
class LogAdmin(admin.ModelAdmin):
list_display = ("created", "message", "level",)
list_filter = ("level", "created",)

View File

@@ -1,21 +1,35 @@
import datetime
import hashlib
import logging
import os
import re
import uuid
import shutil
import hashlib
import logging
import datetime
import tempfile
import itertools
import subprocess
from multiprocessing.pool import Pool
import pyocr
import langdetect
from PIL import Image
from django.conf import settings
from django.utils import timezone
from paperless.db import GnuPG
from pyocr.tesseract import TesseractError
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from .models import Document, FileInfo, Tag
from .parsers import ParseError
from .models import Tag, Document, FileInfo
from .signals import (
document_consumer_declaration,
document_consumption_finished,
document_consumption_started
document_consumption_started,
document_consumption_finished
)
from .languages import ISO639
class OCRError(Exception):
pass
class ConsumerError(Exception):
@@ -33,7 +47,13 @@ class Consumer(object):
"""
SCRATCH = settings.SCRATCH_DIR
CONVERT = settings.CONVERT_BINARY
UNPAPER = settings.UNPAPER_BINARY
CONSUME = settings.CONSUMPTION_DIR
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
def __init__(self):
@@ -58,16 +78,6 @@ class Consumer(object):
raise ConsumerError(
"Consumption directory {} does not exist".format(self.CONSUME))
self.parsers = []
for response in document_consumer_declaration.send(self):
self.parsers.append(response[1])
if not self.parsers:
raise ConsumerError(
"No parsers could be found, not even the default. "
"This is a problem."
)
def log(self, level, message):
getattr(self.logger, level)(message, extra={
"group": self.logging_group
@@ -99,13 +109,6 @@ class Consumer(object):
self._ignore.append(doc)
continue
parser_class = self._get_parser_class(doc)
if not parser_class:
self.log(
"error", "No parsers could be found for {}".format(doc))
self._ignore.append(doc)
continue
self.logging_group = uuid.uuid4()
self.log("info", "Consuming {}".format(doc))
@@ -116,26 +119,25 @@ class Consumer(object):
logging_group=self.logging_group
)
parsed_document = parser_class(doc)
thumbnail = parsed_document.get_thumbnail()
tempdir = tempfile.mkdtemp(prefix="paperless", dir=self.SCRATCH)
imgs = self._get_greyscale(tempdir, doc)
thumbnail = self._get_thumbnail(tempdir, doc)
try:
document = self._store(
parsed_document.get_text(),
doc,
thumbnail
)
except ParseError as e:
document = self._store(self._get_ocr(imgs), doc, thumbnail)
except OCRError as e:
self._ignore.append(doc)
self.log("error", "PARSE FAILURE for {}: {}".format(doc, e))
parsed_document.cleanup()
self.log("error", "OCR FAILURE for {}: {}".format(doc, e))
self._cleanup_tempdir(tempdir)
continue
else:
parsed_document.cleanup()
self._cleanup_tempdir(tempdir)
self._cleanup_doc(doc)
self.log(
@@ -149,30 +151,142 @@ class Consumer(object):
logging_group=self.logging_group
)
def _get_parser_class(self, doc):
def _get_greyscale(self, tempdir, doc):
"""
Determine the appropriate parser class based on the file
Greyscale images are easier for Tesseract to OCR
"""
options = []
for parser in self.parsers:
result = parser(doc)
if result:
options.append(result)
self.log("info", "Generating greyscale image from {}".format(doc))
self.log(
"info",
"Parsers available: {}".format(
", ".join([str(o["parser"].__name__) for o in options])
)
# Convert PDF to multiple PNMs
pnm = os.path.join(tempdir, "convert-%04d.pnm")
run_convert(
self.CONVERT,
"-density", str(self.DENSITY),
"-depth", "8",
"-type", "grayscale",
doc, pnm,
)
if not options:
return None
# Get a list of converted images
pnms = []
for f in os.listdir(tempdir):
if f.endswith(".pnm"):
pnms.append(os.path.join(tempdir, f))
# Return the parser with the highest weight.
return sorted(
options, key=lambda _: _["weight"], reverse=True)[0]["parser"]
# Run unpaper in parallel on converted images
with Pool(processes=self.THREADS) as pool:
pool.map(run_unpaper, itertools.product([self.UNPAPER], pnms))
# Return list of converted images, processed with unpaper
pnms = []
for f in os.listdir(tempdir):
if f.endswith(".unpaper.pnm"):
pnms.append(os.path.join(tempdir, f))
return sorted(filter(lambda __: os.path.isfile(__), pnms))
def _get_thumbnail(self, tempdir, doc):
"""
The thumbnail of a PDF is just a 500px wide image of the first page.
"""
self.log("info", "Generating the thumbnail")
run_convert(
self.CONVERT,
"-scale", "500x5000",
"-alpha", "remove",
doc, os.path.join(tempdir, "convert-%04d.png")
)
return os.path.join(tempdir, "convert-0000.png")
def _guess_language(self, text):
try:
guess = langdetect.detect(text)
self.log("debug", "Language detected: {}".format(guess))
return guess
except Exception as e:
self.log("warning", "Language detection error: {}".format(e))
def _get_ocr(self, imgs):
"""
Attempts to do the best job possible OCR'ing the document based on
simple language detection trial & error.
"""
if not imgs:
raise OCRError("No images found")
self.log("info", "OCRing the document")
# Since the division gets rounded down by int, this calculation works
# for every edge-case, i.e. 1
middle = int(len(imgs) / 2)
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
guessed_language = self._guess_language(raw_text)
if not guessed_language or guessed_language not in ISO639:
self.log("warning", "Language detection failed!")
if settings.FORGIVING_OCR:
self.log(
"warning",
"As FORGIVING_OCR is enabled, we're going to make the "
"best with what we have."
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError("Language detection failed")
if ISO639[guessed_language] == self.DEFAULT_OCR_LANGUAGE:
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
try:
return self._ocr(imgs, ISO639[guessed_language])
except pyocr.pyocr.tesseract.TesseractError:
if settings.FORGIVING_OCR:
self.log(
"warning",
"OCR for {} failed, but we're going to stick with what "
"we've got since FORGIVING_OCR is enabled.".format(
guessed_language
)
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError(
"The guessed language is not available in this instance of "
"Tesseract."
)
def _assemble_ocr_sections(self, imgs, middle, text):
"""
Given a `middle` value and the text that middle page represents, we OCR
the remainder of the document and return the whole thing.
"""
text = self._ocr(imgs[:middle], self.DEFAULT_OCR_LANGUAGE) + text
text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
return text
def _ocr(self, imgs, lang):
"""
Performs a single OCR attempt.
"""
if not imgs:
return ""
self.log("info", "Parsing for {}".format(lang))
with Pool(processes=self.THREADS) as pool:
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
r = " ".join(r)
# Strip out excess white space to allow matching to go smoother
return strip_excess_whitespace(r)
def _store(self, text, doc, thumbnail):
@@ -218,6 +332,10 @@ class Consumer(object):
return document
def _cleanup_tempdir(self, d):
self.log("debug", "Deleting directory {}".format(d))
shutil.rmtree(d)
def _cleanup_doc(self, doc):
self.log("debug", "Deleting document {}".format(doc))
os.unlink(doc)
@@ -243,3 +361,41 @@ class Consumer(object):
with open(doc, "rb") as f:
checksum = hashlib.md5(f.read()).hexdigest()
return Document.objects.filter(checksum=checksum).exists()
def strip_excess_whitespace(text):
collapsed_spaces = re.sub(r"([^\S\r\n]+)", " ", text)
no_leading_whitespace = re.sub(
"([\n\r]+)([^\S\n\r]+)", '\\1', collapsed_spaces)
no_trailing_whitespace = re.sub("([^\S\n\r]+)$", '', no_leading_whitespace)
return no_trailing_whitespace
def image_to_string(args):
img, lang = args
ocr = pyocr.get_available_tools()[0]
with Image.open(os.path.join(Consumer.SCRATCH, img)) as f:
if ocr.can_detect_orientation():
try:
orientation = ocr.detect_orientation(f, lang=lang)
f = f.rotate(orientation["angle"], expand=1)
except (TesseractError, OtherTesseractError):
pass
return ocr.image_to_string(f, lang=lang)
def run_unpaper(args):
unpaper, pnm = args
subprocess.Popen(
(unpaper, pnm, pnm.replace(".pnm", ".unpaper.pnm"))).wait()
def run_convert(*args):
environment = os.environ.copy()
if settings.CONVERT_MEMORY_LIMIT:
environment["MAGICK_MEMORY_LIMIT"] = settings.CONVERT_MEMORY_LIMIT
if settings.CONVERT_TMPDIR:
environment["MAGICK_TMPDIR"] = settings.CONVERT_TMPDIR
subprocess.Popen(args, env=environment).wait()

View File

@@ -8,7 +8,7 @@ class CorrespondentFilterSet(FilterSet):
class Meta(object):
model = Correspondent
fields = {
"name": [
'name': [
"startswith", "endswith", "contains",
"istartswith", "iendswith", "icontains"
],
@@ -21,7 +21,7 @@ class TagFilterSet(FilterSet):
class Meta(object):
model = Tag
fields = {
"name": [
'name': [
"startswith", "endswith", "contains",
"istartswith", "iendswith", "icontains"
],

View File

@@ -3,7 +3,6 @@
from __future__ import unicode_literals
from django.db import migrations, models
from django.conf import settings
class Migration(migrations.Migration):
@@ -20,7 +19,7 @@ class Migration(migrations.Migration):
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('sender', models.CharField(blank=True, db_index=True, max_length=128)),
('title', models.CharField(blank=True, db_index=True, max_length=128)),
('content', models.TextField(db_index=("mysql" not in settings.DATABASES["default"]["ENGINE"]))),
('content', models.TextField(db_index=True)),
('created', models.DateTimeField(auto_now_add=True)),
('modified', models.DateTimeField(auto_now=True)),
],

View File

@@ -47,11 +47,7 @@ class Migration(migrations.Migration):
],
),
migrations.RunPython(move_sender_strings_to_sender_model),
migrations.RemoveField(
model_name='document',
name='sender',
),
migrations.AddField(
migrations.AlterField(
model_name='document',
name='sender',
field=models.ForeignKey(blank=True, on_delete=django.db.models.deletion.CASCADE, to='documents.Sender'),

View File

@@ -1,20 +0,0 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.10.5 on 2017-03-25 15:58
from __future__ import unicode_literals
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('documents', '0015_add_insensitive_to_match'),
]
operations = [
migrations.AlterField(
model_name='document',
name='content',
field=models.TextField(blank=True, db_index=True, help_text='The raw, text-only data of the document. This field is primarily used for searching.'),
),
]

View File

@@ -158,22 +158,13 @@ class Document(models.Model):
correspondent = models.ForeignKey(
Correspondent, blank=True, null=True, related_name="documents")
title = models.CharField(max_length=128, blank=True, db_index=True)
content = models.TextField(
db_index=True,
blank=True,
help_text="The raw, text-only data of the document. This field is "
"primarily used for searching."
)
content = models.TextField(db_index=True)
file_type = models.CharField(
max_length=4,
editable=False,
choices=tuple([(t, t.upper()) for t in TYPES])
)
tags = models.ManyToManyField(
Tag, related_name="documents", blank=True)

View File

@@ -1,45 +0,0 @@
import logging
import shutil
import tempfile
from django.conf import settings
class ParseError(Exception):
pass
class DocumentParser(object):
"""
Subclass this to make your own parser. Have a look at
`paperless_tesseract.parsers` for inspiration.
"""
SCRATCH = settings.SCRATCH_DIR
def __init__(self, path):
self.document_path = path
self.tempdir = tempfile.mkdtemp(prefix="paperless", dir=self.SCRATCH)
self.logger = logging.getLogger(__name__)
self.logging_group = None
def get_thumbnail(self):
"""
Returns the path to a file we can use as a thumbnail for this document.
"""
raise NotImplementedError()
def get_text(self):
"""
Returns the text from the document and only the text.
"""
raise NotImplementedError()
def log(self, level, message):
getattr(self.logger, level)(message, extra={
"group": self.logging_group
})
def cleanup(self):
self.log("debug", "Deleting directory {}".format(self.tempdir))
shutil.rmtree(self.tempdir)

View File

@@ -18,21 +18,12 @@ class TagSerializer(serializers.HyperlinkedModelSerializer):
"id", "slug", "name", "colour", "match", "matching_algorithm")
class CorrespondentField(serializers.HyperlinkedRelatedField):
def get_queryset(self):
return Correspondent.objects.all()
class TagsField(serializers.HyperlinkedRelatedField):
def get_queryset(self):
return Tag.objects.all()
class DocumentSerializer(serializers.ModelSerializer):
correspondent = CorrespondentField(
view_name="drf:correspondent-detail", allow_null=True)
tags = TagsField(view_name="drf:tag-detail", many=True)
correspondent = serializers.HyperlinkedRelatedField(
read_only=True, view_name="drf:correspondent-detail", allow_null=True)
tags = serializers.HyperlinkedRelatedField(
read_only=True, view_name="drf:tag-detail", many=True)
class Meta(object):
model = Document

View File

@@ -2,4 +2,3 @@ from django.dispatch import Signal
document_consumption_started = Signal(providing_args=["filename"])
document_consumption_finished = Signal(providing_args=["document"])
document_consumer_declaration = Signal(providing_args=[])

View File

@@ -1,5 +1,6 @@
import logging
import os
from subprocess import Popen
from django.conf import settings

View File

@@ -10,14 +10,3 @@ td a.tag {
margin: 1px;
display: inline-block;
}
#result_list th.column-note {
text-align: right;
}
#result_list td.field-note {
text-align: right;
}
#result_list td textarea {
width: 90%;
height: 5em;
}

View File

@@ -1,6 +0,0 @@
{% load hacks %}
{# See documents.templatetags.hacks.change_list_results for an explanation #}
{% change_list_results %}

View File

@@ -1,167 +0,0 @@
{% load i18n %}
<style>
.grid *, .grid *:after, .grid *:before {
-webkit-box-sizing: border-box;
-moz-box-sizing: border-box;
box-sizing: border-box;
}
.box {
width: 12.5%;
padding: 1em;
float: left;
opacity: 0.7;
transition: all 0.5s;
}
.box:hover {
opacity: 1;
transition: all 0.5s;
}
.box:last-of-type {
padding-right: 0;
}
.result {
border: 1px solid #cccccc;
border-radius: 2%;
overflow: hidden;
height: 300px;
}
.result .header {
padding: 5px;
background-color: #79AEC8;
height: 6em;
}
.result .header .checkbox {
margin-right: 5px;
}
.result .header .checkbox{
width: 5%;
float: left;
}
.result .header .info {
width: 90%;
float: left;
}
.result .header a,
.result a.tag {
color: #ffffff;
}
.result .date {
padding: 5px;
}
.result .tags {
float: left;
}
.result .tags a.tag {
padding: 2px 5px;
border-radius: 2px;
display: inline-block;
margin: 2px;
}
.result .date {
float: right;
color: #cccccc;
}
.result .image img {
width: 100%;
}
.grid {
margin-right: 260px;
}
.grid:after {
content: "";
display: table;
clear: both;
}
@media (max-width: 1600px) {
.box {
width: 25%
}
}
@media (max-width: 991px) {
.grid {
margin-right: 220px;
}
.box {
width: 50%
}
}
@media (max-width: 767px) {
.grid {
margin-right: 0;
}
}
@media (max-width: 500px) {
.box {
width: 100%
}
}
</style>
{# This is just copypasta from the parent change_list_results.html file #}
<table id="result_list">
<thead>
<tr>
{% for header in result_headers %}
<th scope="col" {{ header.class_attrib }}>
{% if header.sortable %}
{% if header.sort_priority > 0 %}
<div class="sortoptions">
<a class="sortremove" href="{{ header.url_remove }}" title="{% trans "Remove from sorting" %}"></a>
{% if num_sorted_fields > 1 %}<span class="sortpriority" title="{% blocktrans with priority_number=header.sort_priority %}Sorting priority: {{ priority_number }}{% endblocktrans %}">{{ header.sort_priority }}</span>{% endif %}
<a href="{{ header.url_toggle }}" class="toggle {% if header.ascending %}ascending{% else %}descending{% endif %}" title="{% trans "Toggle sorting" %}"></a>
</div>
{% endif %}
{% endif %}
<div class="text">{% if header.sortable %}<a href="{{ header.url_primary }}">{{ header.text|capfirst }}</a>{% else %}<span>{{ header.text|capfirst }}</span>{% endif %}</div>
<div class="clear"></div>
</th>{% endfor %}
</tr>
</thead>
</table>
{# /copypasta #}
<div class="grid">
{% for result in results %}
{# 0: Checkbox #}
{# 1: Title #}
{# 2: Date #}
{# 3: Image #}
{# 4: Correspondent #}
{# 5: Tags #}
<div class="box">
<div class="result">
<div class="header">
<div class="checkbox">{{ result.0 }}</div>
<div class="info">
{{ result.4 }}<br />
{{ result.1 }}
</div>
<div style="clear: both;"></div>
</div>
<div class="tags">{{ result.5 }}</div>
<div class="date">{{ result.2 }}</div>
<div style="clear: both;"></div>
<div class="image">{{ result.3 }}</div>
</div>
</div>
{% endfor %}
</div>
<script>
// We need to re-build the select-all functionality as the old logic pointed
// to a table and we're using divs now.
django.jQuery("#action-toggle").on("change", function(){
django.jQuery(".grid .box .result .checkbox input")
.prop("checked", this.checked);
});
</script>

View File

@@ -1,41 +0,0 @@
import os
from django.contrib import admin
from django.template import Library
from django.template.loader import get_template
from ..models import Document
register = Library()
@register.simple_tag(takes_context=True)
def change_list_results(context):
"""
Django has a lot of places where you can override defaults, but
unfortunately, `change_list_results.html` is not one of them. In fact,
it's a downright pain in the ass to override this file on a per-model basis
and this is the cleanest way I could come up with.
Basically all we've done here is defined `change_list_results.html` in an
`admin` directory which globally overrides that file for *every* model.
That template however simply loads this templatetag which determines
whether we're currently looking at a `Document` listing or something else
and loads the appropriate file in each case.
Better work arounds for this are welcome as I hate this myself, but at the
moment, it's all I could come up with.
"""
path = os.path.join(
os.path.dirname(admin.__file__),
"templates",
"admin",
"change_list_results.html"
)
if context["cl"].model == Document:
path = "admin/documents/document/change_list_results.html"
return get_template(path).render(context)

View File

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 32 KiB

View File

@@ -1,54 +1,13 @@
import os
from unittest import mock, skipIf
import pyocr
from django.test import TestCase
from unittest import mock
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from ..consumer import Consumer
from ..models import FileInfo
class TestConsumer(TestCase):
class DummyParser(object):
pass
def test__get_parser_class_1_parser(self):
self.assertEqual(
self._get_consumer()._get_parser_class("doc.pdf"),
self.DummyParser
)
@mock.patch("documents.consumer.os.makedirs")
@mock.patch("documents.consumer.os.path.exists", return_value=True)
@mock.patch("documents.consumer.document_consumer_declaration.send")
def test__get_parser_class_n_parsers(self, m, *args):
class DummyParser1(object):
pass
class DummyParser2(object):
pass
m.return_value = (
(None, lambda _: {"weight": 0, "parser": DummyParser1}),
(None, lambda _: {"weight": 1, "parser": DummyParser2}),
)
self.assertEqual(Consumer()._get_parser_class("doc.pdf"), DummyParser2)
@mock.patch("documents.consumer.os.makedirs")
@mock.patch("documents.consumer.os.path.exists", return_value=True)
@mock.patch("documents.consumer.document_consumer_declaration.send")
def test__get_parser_class_0_parsers(self, m, *args):
m.return_value = ((None, lambda _: None),)
self.assertIsNone(Consumer()._get_parser_class("doc.pdf"))
@mock.patch("documents.consumer.os.makedirs")
@mock.patch("documents.consumer.os.path.exists", return_value=True)
@mock.patch("documents.consumer.document_consumer_declaration.send")
def _get_consumer(self, m, *args):
m.return_value = (
(None, lambda _: {"weight": 0, "parser": self.DummyParser}),
)
return Consumer()
from ..consumer import image_to_string, strip_excess_whitespace
class TestAttributes(TestCase):
@@ -349,3 +308,71 @@ class TestFieldPermutations(TestCase):
}
self._test_guessed_attributes(
template.format(**spec), **spec)
class FakeTesseract(object):
@staticmethod
def can_detect_orientation():
return True
@staticmethod
def detect_orientation(file_handle, lang):
raise OtherTesseractError("arbitrary status", "message")
@staticmethod
def image_to_string(file_handle, lang):
return "This is test text"
class FakePyOcr(object):
@staticmethod
def get_available_tools():
return [FakeTesseract]
class TestOCR(TestCase):
text_cases = [
("simple string", "simple string"),
(
"simple newline\n testing string",
"simple newline\ntesting string"
),
(
"utf-8 строка с пробелами в конце ",
"utf-8 строка с пробелами в конце"
)
]
SAMPLE_FILES = os.path.join(os.path.dirname(__file__), "samples")
TESSERACT_INSTALLED = bool(pyocr.get_available_tools())
def test_strip_excess_whitespace(self):
for source, result in self.text_cases:
actual_result = strip_excess_whitespace(source)
self.assertEqual(
result,
actual_result,
"strip_exceess_whitespace({}) != '{}', but '{}'".format(
source,
result,
actual_result
)
)
@skipIf(not TESSERACT_INSTALLED, "Tesseract not installed. Skipping")
@mock.patch("documents.consumer.Consumer.SCRATCH", SAMPLE_FILES)
@mock.patch("documents.consumer.pyocr", FakePyOcr)
def test_image_to_string_with_text_free_page(self):
"""
This test is sort of silly, since it's really just reproducing an odd
exception thrown by pyocr when it encounters a page with no text.
Actually running this test against an installation of Tesseract results
in a segmentation fault rooted somewhere deep inside pyocr where I
don't care to dig. Regardless, if you run the consumer normally,
text-free pages are now handled correctly so long as we work around
this weird exception.
"""
image_to_string(["no-text.png", "en"])

View File

@@ -1,17 +1,17 @@
from django.contrib.auth.mixins import LoginRequiredMixin
from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
from django.views.generic import DetailView, FormView, TemplateView
from django_filters.rest_framework import DjangoFilterBackend
from rest_framework.filters import SearchFilter, OrderingFilter
from paperless.db import GnuPG
from paperless.mixins import SessionOrBasicAuthMixin
from paperless.views import StandardPagination
from rest_framework.filters import OrderingFilter, SearchFilter
from rest_framework.mixins import (
DestroyModelMixin,
ListModelMixin,
RetrieveModelMixin,
UpdateModelMixin
)
from rest_framework.pagination import PageNumberPagination
from rest_framework.permissions import IsAuthenticated
from rest_framework.viewsets import (
GenericViewSet,
@@ -41,7 +41,7 @@ class IndexView(TemplateView):
return TemplateView.get_context_data(self, **kwargs)
class FetchView(SessionOrBasicAuthMixin, DetailView):
class FetchView(LoginRequiredMixin, DetailView):
model = Document
@@ -74,7 +74,7 @@ class FetchView(SessionOrBasicAuthMixin, DetailView):
return response
class PushView(SessionOrBasicAuthMixin, FormView):
class PushView(LoginRequiredMixin, FormView):
"""
A crude REST-ish API for creating documents.
"""
@@ -92,6 +92,12 @@ class PushView(SessionOrBasicAuthMixin, FormView):
return HttpResponse("0")
class StandardPagination(PageNumberPagination):
page_size = 25
page_size_query_param = "page-size"
max_page_size = 100000
class CorrespondentViewSet(ModelViewSet):
model = Correspondent
queryset = Correspondent.objects.all()

View File

@@ -1 +1 @@
from .checks import paths_check, binaries_check
from .checks import paths_check

View File

@@ -1,15 +1,10 @@
import os
import shutil
from django.conf import settings
from django.core.checks import Error, register, Warning
@register()
def paths_check(app_configs, **kwargs):
"""
Check the various paths for existence, readability and writeability
"""
check_messages = []
@@ -49,38 +44,4 @@ def paths_check(app_configs, **kwargs):
writeable_hint.format(directory)
))
directory = os.getenv("PAPERLESS_STATICDIR")
if directory:
if not os.path.exists(directory):
check_messages.append(Error(
exists_message.format("PAPERLESS_STATICDIR"),
exists_hint.format(directory)
))
if not check_messages:
if not os.access(directory, os.W_OK | os.X_OK):
check_messages.append(Error(
writeable_message.format("PAPERLESS_STATICDIR"),
writeable_hint.format(directory)
))
return check_messages
@register()
def binaries_check(app_configs, **kwargs):
"""
Paperless requires the existence of a few binaries, so we do some checks
for those here.
"""
error = "Paperless can't find {}. Without it, consumption is impossible."
hint = "Either it's not in your ${PATH} or it's not installed."
binaries = (settings.CONVERT_BINARY, settings.UNPAPER_BINARY, "tesseract")
check_messages = []
for binary in binaries:
if shutil.which(binary) is None:
check_messages.append(Warning(error.format(binary), hint))
return check_messages

View File

@@ -1,46 +0,0 @@
from django.contrib.auth.mixins import AccessMixin
from django.contrib.auth import authenticate, login
import base64
class SessionOrBasicAuthMixin(AccessMixin):
"""
Session or Basic Authentication mixin for Django.
It determines if the requester is already logged in or if they have
provided proper http-authorization and returning the view if all goes
well, otherwise responding with a 401.
Base for mixin found here: https://djangosnippets.org/snippets/3073/
"""
def dispatch(self, request, *args, **kwargs):
# check if user is authenticated via the session
if request.user.is_authenticated:
# Already logged in, just return the view.
return super(SessionOrBasicAuthMixin, self).dispatch(
request, *args, **kwargs
)
# apparently not authenticated via session, maybe via HTTP Basic?
if 'HTTP_AUTHORIZATION' in request.META:
auth = request.META['HTTP_AUTHORIZATION'].split()
if len(auth) == 2:
# NOTE: Support for only basic authentication
if auth[0].lower() == "basic":
authString = base64.b64decode(auth[1]).decode('utf-8')
uname, passwd = authString.split(':')
user = authenticate(username=uname, password=passwd)
if user is not None:
if user.is_active:
login(request, user)
request.user = user
return super(
SessionOrBasicAuthMixin, self
).dispatch(
request, *args, **kwargs
)
# nope, really not authenticated
return self.handle_no_permission()

View File

@@ -14,12 +14,6 @@ import os
from dotenv import load_dotenv
# Tap paperless.conf if it's available
if os.path.exists("/etc/paperless.conf"):
load_dotenv("/etc/paperless.conf")
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
@@ -27,14 +21,8 @@ BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/1.9/howto/deployment/checklist/
# The secret key has a default that should be fine so long as you're hosting
# Paperless on a closed network. However, if you're putting this anywhere
# public, you should change the key to something unique and verbose.
SECRET_KEY = os.getenv(
"PAPERLESS_SECRET_KEY",
"e11fl1oa-*ytql8p)(06fbj4ukrlo+n7k&q5+$1md7i+mge=ee"
)
# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = 'e11fl1oa-*ytql8p)(06fbj4ukrlo+n7k&q5+$1md7i+mge=ee'
# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = True
@@ -44,37 +32,34 @@ LOGIN_URL = '/admin/login'
ALLOWED_HOSTS = ["*"]
_allowed_hosts = os.getenv("PAPERLESS_ALLOWED_HOSTS")
if _allowed_hosts:
if allowed_hosts:
ALLOWED_HOSTS = _allowed_hosts.split(",")
# Tap paperless.conf if it's available
if os.path.exists("/etc/paperless.conf"):
load_dotenv("/etc/paperless.conf")
# Application definition
INSTALLED_APPS = [
"django.contrib.auth",
"django.contrib.contenttypes",
"django.contrib.sessions",
"django.contrib.messages",
"django.contrib.staticfiles",
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
"django_extensions",
"documents.apps.DocumentsConfig",
"reminders.apps.RemindersConfig",
"paperless_tesseract.apps.PaperlessTesseractConfig",
"flat_responsive",
"django.contrib.admin",
"rest_framework",
"crispy_forms",
]
if os.getenv("PAPERLESS_INSTALLED_APPS"):
INSTALLED_APPS += os.getenv("PAPERLESS_INSTALLED_APPS").split(",")
MIDDLEWARE_CLASSES = [
'django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
@@ -156,7 +141,7 @@ AUTH_PASSWORD_VALIDATORS = [
LANGUAGE_CODE = 'en-us'
TIME_ZONE = os.getenv("PAPERLESS_TIME_ZONE", "UTC")
TIME_ZONE = 'UTC'
USE_I18N = True
@@ -168,8 +153,7 @@ USE_TZ = True
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/1.9/howto/static-files/
STATIC_ROOT = os.getenv(
"PAPERLESS_STATICDIR", os.path.join(BASE_DIR, "..", "static"))
STATIC_ROOT = os.path.join(BASE_DIR, "..", "static")
MEDIA_ROOT = os.getenv(
"PAPERLESS_MEDIADIR", os.path.join(BASE_DIR, "..", "media"))
@@ -203,7 +187,7 @@ LOGGING = {
# The default language that tesseract will attempt to use when parsing
# documents. It should be a 3-letter language code consistent with ISO 639.
OCR_LANGUAGE = os.getenv("PAPERLESS_OCR_LANGUAGE", "eng")
OCR_LANGUAGE = "eng"
# The amount of threads to use for OCR
OCR_THREADS = os.getenv("PAPERLESS_OCR_THREADS")
@@ -265,8 +249,3 @@ SHARED_SECRET = os.getenv("PAPERLESS_SHARED_SECRET", "")
# Trigger a script after every successful document consumption?
PRE_CONSUME_SCRIPT = os.getenv("PAPERLESS_PRE_CONSUME_SCRIPT")
POST_CONSUME_SCRIPT = os.getenv("PAPERLESS_POST_CONSUME_SCRIPT")
# The number of items on each page in the web UI. This value must be a
# positive integer, but if you don't define one in paperless.conf, a default of
# 100 will be used.
PAPERLESS_LIST_PER_PAGE = int(os.getenv("PAPERLESS_LIST_PER_PAGE", 100))

View File

@@ -24,14 +24,12 @@ from documents.views import (
IndexView, FetchView, PushView,
CorrespondentViewSet, TagViewSet, DocumentViewSet, LogViewSet
)
from reminders.views import ReminderViewSet
router = DefaultRouter()
router.register(r"correspondents", CorrespondentViewSet)
router.register(r"documents", DocumentViewSet)
router.register(r"logs", LogViewSet)
router.register(r"reminders", ReminderViewSet)
router.register(r"tags", TagViewSet)
router.register(r'correspondents', CorrespondentViewSet)
router.register(r'tags', TagViewSet)
router.register(r'documents', DocumentViewSet)
router.register(r'logs', LogViewSet)
urlpatterns = [

View File

@@ -1 +1 @@
__version__ = (0, 3, 6)
__version__ = (0, 3, 2)

View File

@@ -1,7 +0,0 @@
from rest_framework.pagination import PageNumberPagination
class StandardPagination(PageNumberPagination):
page_size = 25
page_size_query_param = "page-size"
max_page_size = 100000

View File

@@ -1,16 +0,0 @@
from django.apps import AppConfig
class PaperlessTesseractConfig(AppConfig):
name = "paperless_tesseract"
def ready(self):
from documents.signals import document_consumer_declaration
from .signals import ConsumerDeclaration
document_consumer_declaration.connect(ConsumerDeclaration.handle)
AppConfig.ready(self)

View File

@@ -1,214 +0,0 @@
import itertools
import os
import re
import subprocess
from multiprocessing.pool import Pool
import langdetect
import pyocr
from django.conf import settings
from documents.parsers import DocumentParser, ParseError
from PIL import Image
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from pyocr.tesseract import TesseractError
from .languages import ISO639
class OCRError(Exception):
pass
class RasterisedDocumentParser(DocumentParser):
"""
This parser uses Tesseract to try and get some text out of a rasterised
image, whether it's a PDF, or other graphical format (JPEG, TIFF, etc.)
"""
CONVERT = settings.CONVERT_BINARY
DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
UNPAPER = settings.UNPAPER_BINARY
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
def get_thumbnail(self):
"""
The thumbnail of a PDF is just a 500px wide image of the first page.
"""
run_convert(
self.CONVERT,
"-scale", "500x5000",
"-alpha", "remove",
self.document_path, os.path.join(self.tempdir, "convert-%04d.png")
)
return os.path.join(self.tempdir, "convert-0000.png")
def get_text(self):
images = self._get_greyscale()
try:
return self._get_ocr(images)
except OCRError as e:
raise ParseError(e)
def _get_greyscale(self):
"""
Greyscale images are easier for Tesseract to OCR
"""
# Convert PDF to multiple PNMs
pnm = os.path.join(self.tempdir, "convert-%04d.pnm")
run_convert(
self.CONVERT,
"-density", str(self.DENSITY),
"-depth", "8",
"-type", "grayscale",
self.document_path, pnm,
)
# Get a list of converted images
pnms = []
for f in os.listdir(self.tempdir):
if f.endswith(".pnm"):
pnms.append(os.path.join(self.tempdir, f))
# Run unpaper in parallel on converted images
with Pool(processes=self.THREADS) as pool:
pool.map(run_unpaper, itertools.product([self.UNPAPER], pnms))
# Return list of converted images, processed with unpaper
pnms = []
for f in os.listdir(self.tempdir):
if f.endswith(".unpaper.pnm"):
pnms.append(os.path.join(self.tempdir, f))
return sorted(filter(lambda __: os.path.isfile(__), pnms))
def _guess_language(self, text):
try:
guess = langdetect.detect(text)
self.log("debug", "Language detected: {}".format(guess))
return guess
except Exception as e:
self.log("warning", "Language detection error: {}".format(e))
def _get_ocr(self, imgs):
"""
Attempts to do the best job possible OCR'ing the document based on
simple language detection trial & error.
"""
if not imgs:
raise OCRError("No images found")
self.log("info", "OCRing the document")
# Since the division gets rounded down by int, this calculation works
# for every edge-case, i.e. 1
middle = int(len(imgs) / 2)
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
guessed_language = self._guess_language(raw_text)
if not guessed_language or guessed_language not in ISO639:
self.log("warning", "Language detection failed!")
if settings.FORGIVING_OCR:
self.log(
"warning",
"As FORGIVING_OCR is enabled, we're going to make the "
"best with what we have."
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError("Language detection failed")
if ISO639[guessed_language] == self.DEFAULT_OCR_LANGUAGE:
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
try:
return self._ocr(imgs, ISO639[guessed_language])
except pyocr.pyocr.tesseract.TesseractError:
if settings.FORGIVING_OCR:
self.log(
"warning",
"OCR for {} failed, but we're going to stick with what "
"we've got since FORGIVING_OCR is enabled.".format(
guessed_language
)
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError(
"The guessed language is not available in this instance of "
"Tesseract."
)
def _ocr(self, imgs, lang):
"""
Performs a single OCR attempt.
"""
if not imgs:
return ""
self.log("info", "Parsing for {}".format(lang))
with Pool(processes=self.THREADS) as pool:
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
r = " ".join(r)
# Strip out excess white space to allow matching to go smoother
return strip_excess_whitespace(r)
def _assemble_ocr_sections(self, imgs, middle, text):
"""
Given a `middle` value and the text that middle page represents, we OCR
the remainder of the document and return the whole thing.
"""
text = self._ocr(imgs[:middle], self.DEFAULT_OCR_LANGUAGE) + text
text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
return text
def run_convert(*args):
environment = os.environ.copy()
if settings.CONVERT_MEMORY_LIMIT:
environment["MAGICK_MEMORY_LIMIT"] = settings.CONVERT_MEMORY_LIMIT
if settings.CONVERT_TMPDIR:
environment["MAGICK_TMPDIR"] = settings.CONVERT_TMPDIR
subprocess.Popen(args, env=environment).wait()
def run_unpaper(args):
unpaper, pnm = args
subprocess.Popen(
(unpaper, pnm, pnm.replace(".pnm", ".unpaper.pnm"))).wait()
def strip_excess_whitespace(text):
collapsed_spaces = re.sub(r"([^\S\r\n]+)", " ", text)
no_leading_whitespace = re.sub(
"([\n\r]+)([^\S\n\r]+)", '\\1', collapsed_spaces)
no_trailing_whitespace = re.sub("([^\S\n\r]+)$", '', no_leading_whitespace)
return no_trailing_whitespace
def image_to_string(args):
img, lang = args
ocr = pyocr.get_available_tools()[0]
with Image.open(os.path.join(RasterisedDocumentParser.SCRATCH, img)) as f:
if ocr.can_detect_orientation():
try:
orientation = ocr.detect_orientation(f, lang=lang)
f = f.rotate(orientation["angle"], expand=1)
except (TesseractError, OtherTesseractError):
pass
return ocr.image_to_string(f, lang=lang)

View File

@@ -1,23 +0,0 @@
import re
from .parsers import RasterisedDocumentParser
class ConsumerDeclaration(object):
MATCHING_FILES = re.compile("^.*\.(pdf|jpg|gif|png|tiff?|pnm|bmp)$")
@classmethod
def handle(cls, sender, **kwargs):
return cls.test
@classmethod
def test(cls, doc):
if cls.MATCHING_FILES.match(doc.lower()):
return {
"parser": RasterisedDocumentParser,
"weight": 0
}
return None

View File

@@ -1,80 +0,0 @@
import os
from unittest import mock, skipIf
import pyocr
from django.test import TestCase
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from ..parsers import image_to_string, strip_excess_whitespace
class FakeTesseract(object):
@staticmethod
def can_detect_orientation():
return True
@staticmethod
def detect_orientation(file_handle, lang):
raise OtherTesseractError("arbitrary status", "message")
@staticmethod
def image_to_string(file_handle, lang):
return "This is test text"
class FakePyOcr(object):
@staticmethod
def get_available_tools():
return [FakeTesseract]
class TestOCR(TestCase):
text_cases = [
("simple string", "simple string"),
(
"simple newline\n testing string",
"simple newline\ntesting string"
),
(
"utf-8 строка с пробелами в конце ",
"utf-8 строка с пробелами в конце"
)
]
SAMPLE_FILES = os.path.join(os.path.dirname(__file__), "samples")
TESSERACT_INSTALLED = bool(pyocr.get_available_tools())
def test_strip_excess_whitespace(self):
for source, result in self.text_cases:
actual_result = strip_excess_whitespace(source)
self.assertEqual(
result,
actual_result,
"strip_exceess_whitespace({}) != '{}', but '{}'".format(
source,
result,
actual_result
)
)
@skipIf(not TESSERACT_INSTALLED, "Tesseract not installed. Skipping")
@mock.patch(
"paperless_tesseract.parsers.RasterisedDocumentParser.SCRATCH",
SAMPLE_FILES
)
@mock.patch("paperless_tesseract.parsers.pyocr", FakePyOcr)
def test_image_to_string_with_text_free_page(self):
"""
This test is sort of silly, since it's really just reproducing an odd
exception thrown by pyocr when it encounters a page with no text.
Actually running this test against an installation of Tesseract results
in a segmentation fault rooted somewhere deep inside pyocr where I
don't care to dig. Regardless, if you run the consumer normally,
text-free pages are now handled correctly so long as we work around
this weird exception.
"""
image_to_string(["no-text.png", "en"])

View File

@@ -1,36 +0,0 @@
from django.test import TestCase
from ..signals import ConsumerDeclaration
class SignalsTestCase(TestCase):
def test_test_handles_various_file_names_true(self):
prefixes = (
"doc", "My Document", "Μυ Γρεεκ Δοψθμεντ", "Doc -with - tags",
"A document with a . in it", "Doc with -- in it"
)
suffixes = (
"pdf", "jpg", "gif", "png", "tiff", "tif", "pnm", "bmp",
"PDF", "JPG", "GIF", "PNG", "TIFF", "TIF", "PNM", "BMP",
"pDf", "jPg", "gIf", "pNg", "tIff", "tIf", "pNm", "bMp",
)
for prefix in prefixes:
for suffix in suffixes:
name = "{}.{}".format(prefix, suffix)
self.assertTrue(ConsumerDeclaration.test(name))
def test_test_handles_various_file_names_false(self):
prefixes = ("doc",)
suffixes = ("txt", "markdown", "",)
for prefix in prefixes:
for suffix in suffixes:
name = "{}.{}".format(prefix, suffix)
self.assertFalse(ConsumerDeclaration.test(name))
self.assertFalse(ConsumerDeclaration.test(""))
self.assertFalse(ConsumerDeclaration.test("doc"))

View File

@@ -1,20 +0,0 @@
from django.conf import settings
from django.contrib import admin
from .models import Reminder
class ReminderAdmin(admin.ModelAdmin):
class Media:
css = {
"all": ("paperless.css",)
}
list_per_page = settings.PAPERLESS_LIST_PER_PAGE
list_display = ("date", "document", "note")
list_filter = ("date",)
list_editable = ("note",)
admin.site.register(Reminder, ReminderAdmin)

View File

@@ -1,5 +0,0 @@
from django.apps import AppConfig
class RemindersConfig(AppConfig):
name = "reminders"

View File

@@ -1,14 +0,0 @@
from django_filters.rest_framework import CharFilter, FilterSet
from .models import Reminder
class ReminderFilterSet(FilterSet):
class Meta(object):
model = Reminder
fields = {
"document": ["exact"],
"date": ["gt", "lt", "gte", "lte", "exact"],
"note": ["istartswith", "iendswith", "icontains"]
}

View File

@@ -1,27 +0,0 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.10.5 on 2017-03-25 15:58
from __future__ import unicode_literals
from django.db import migrations, models
import django.db.models.deletion
class Migration(migrations.Migration):
initial = True
dependencies = [
('documents', '0016_auto_20170325_1558'),
]
operations = [
migrations.CreateModel(
name='Reminder',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('date', models.DateTimeField()),
('note', models.TextField(blank=True)),
('document', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='documents.Document')),
],
),
]

View File

@@ -1,8 +0,0 @@
from django.db import models
class Reminder(models.Model):
document = models.ForeignKey("documents.Document")
date = models.DateTimeField()
note = models.TextField(blank=True)

View File

@@ -1,14 +0,0 @@
from documents.models import Document
from rest_framework import serializers
from .models import Reminder
class ReminderSerializer(serializers.HyperlinkedModelSerializer):
document = serializers.HyperlinkedRelatedField(
view_name="drf:document-detail", queryset=Document.objects)
class Meta(object):
model = Reminder
fields = ("id", "document", "date", "note")

View File

@@ -1,3 +0,0 @@
from django.test import TestCase
# Create your tests here.

View File

@@ -1,22 +0,0 @@
from django_filters.rest_framework import DjangoFilterBackend
from rest_framework.filters import OrderingFilter
from rest_framework.permissions import IsAuthenticated
from rest_framework.viewsets import (
ModelViewSet,
)
from .filters import ReminderFilterSet
from .models import Reminder
from .serialisers import ReminderSerializer
from paperless.views import StandardPagination
class ReminderViewSet(ModelViewSet):
model = Reminder
queryset = Reminder.objects
serializer_class = ReminderSerializer
pagination_class = StandardPagination
permission_classes = (IsAuthenticated,)
filter_backends = (DjangoFilterBackend, OrderingFilter)
filter_class = ReminderFilterSet
ordering_fields = ("date", "document")

View File

@@ -5,7 +5,7 @@
[tox]
skipsdist = True
envlist = py34, py35, py36, pep8
envlist = py34, py35, pep8
[testenv]
commands = {envpython} manage.py test