58 Commits

Author SHA1 Message Date
Trenton H
a2b7687c3b In the case of an RTL language being extracted via pdfminer.six, fall back to forced OCR, which handles RTL text better 2022-12-29 16:02:02 -08:00
Trenton Holmes
55ef0d4a1b Fixes language code checks around two part languages 2022-12-04 12:23:12 -08:00
Trenton H
e96d65f945 Allows parsing of WebP format images 2022-11-28 09:35:54 -08:00
Trenton H
f015556562 Adds a test to cover this edge case 2022-11-22 07:22:41 -08:00
Trenton Holmes
d1aa08850d Reverts the change around skip_noarchive to align with how it is documented to work 2022-10-20 13:34:41 -07:00
Trenton Holmes
b3b2519bf0 Fixes the creation of an archive file, even if noarchive was specified 2022-08-20 13:47:56 -07:00
Trenton Holmes
49a843dcdd Changes the simple-alpha parsing test to use a tempdir so the original isn't modified in Git 2022-07-02 16:19:22 +02:00
Trenton Holmes
1771d18a21 Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
kpj
fc695896dd Format Python code with black 2022-02-27 15:26:41 +01:00
Martin Müller
73a8569d21 Modify test for PNG image with alpha 2022-02-21 22:38:25 +01:00
jonaswinkler
0e596bd1fc also apply \0 removal to sidecar contents 2021-03-22 23:08:34 +01:00
jonaswinkler
40ce38254b fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
6ab884a95c update dependencies 2021-02-28 13:01:26 +01:00
jonaswinkler
99a18516b2 tests 2021-02-22 00:17:16 +01:00
jonaswinkler
50c1978d36 tests 2021-02-21 00:18:34 +01:00
jonaswinkler
9cbb1c5726 add some test files 2021-02-21 00:13:08 +01:00
jonaswinkler
56bd966c02 local import of ocrmypdf so that the webserver does not load that 2021-02-15 12:18:10 +01:00
jonaswinkler
89d6e422f5 fix bugs and test cases 2021-01-02 15:37:27 +01:00
jonaswinkler
1b1b57eb6a more tests 2020-12-19 15:54:13 +01:00
jonaswinkler
a0631413d6 fixes bauerj/paperless_app#23 and most of all other scanner apps out there. 2020-12-12 18:25:15 +01:00
jonaswinkler
e3ce573fbb a couple fixes and more supported image files 2020-12-02 17:39:49 +01:00
jonaswinkler
12fa844c7f testing the new noarchive option. 2020-12-01 14:30:13 +01:00
jonaswinkler
ac1b701000 more tests! 2020-11-29 19:58:48 +01:00
jonaswinkler
06cfc3113a test case fixes. 2020-11-27 14:06:37 +01:00
Jonas Winkler
e87575240d more tests of the new parser 2020-11-26 00:08:23 +01:00
Jonas Winkler
f51d2be303 fixed the test cases 2020-11-25 19:51:09 +01:00
Jonas Winkler
56ce267f89 removed obsolete tests. 2020-11-25 14:51:32 +01:00
Jonas Winkler
41650f20f4 mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
1655d85a53 testing the tesseract parser 2020-11-19 20:31:08 +01:00
Jonas Winkler
d2e22e3f27 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
f182709fdd fixed most of the tests 2020-11-02 19:42:23 +01:00
Jonas Winkler
7d282a4e4e removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Johannes Wienke
a311cd498c Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Johannes Wienke
a3aab0cb48 Remove duplicated date parsing test
The exact same tests existed twice in the file.
2020-03-08 18:26:29 +01:00
Daniel Quinn
637b0d4cc2 Drop problematic tests
Some tests had differing outcomes depending on the version of Tesseract
installed on the test system.  This lead to a bunch of false test
failures, which lead to people (including me) just ignoring the Travis
results.

This commit removes those tests, and while it reduces our coverage, at
least the results are predictable.
2018-12-30 17:32:45 +00:00
Daniel Quinn
27af2603f5 Use modern languages for sample test files 2018-12-30 14:09:17 +00:00
Erik Arvstedt
a19f0ef97e Fix date test sample image
The previous version of `tests_date_3.png` had too much spacing
between the `0` and the `8` glyphs, which resulted in the year getting
parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00).
This caused the date parsing test to fail.
2018-12-02 15:10:21 +01:00
Daniel Quinn
d544f269e0 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Daniel Quinn
650db75c2b Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing 2018-12-01 16:57:16 +00:00
Daniel Quinn
c1d18c1e83 Fix language guesses in tests
It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.
2018-12-01 15:55:59 +00:00
Joshua Taillon
730daa3d6d Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing 2018-11-15 23:17:59 -05:00
Joshua Taillon
e1d8744c66 Add option for parsing of date from filename (and associated tests) 2018-11-15 20:32:15 -05:00
Joshua Taillon
4409f65840 Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC 2018-11-15 20:30:23 -05:00
Daniel Quinn
2a3f766b93 Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
8010d72f18 Tweak the date guesser to not allow dates prior to 1900 (#414) 2018-10-01 20:03:47 +01:00
Erik Arvstedt
be2cbebaf7 Stop tests from writing to the source tree 2018-07-19 23:48:23 +02:00
Wolf-Bastian Pöttner
fba58f3bdd Increase testcoverage by testing two more date detection cases 2018-02-19 21:36:48 +01:00
Daniel Quinn
6662ca3467 Fix formatting 2018-02-18 18:00:34 +00:00
Daniel Quinn
6f1ed89e26 Fix tests to use _text instead of TEXT_CACHE 2018-02-18 18:00:22 +00:00