Django

Code

Ticket #1919 (reopened)

Opened 3 years ago

Last modified 1 year ago

filter truncatewords wipes newlines from string, so not chainable with markup filters

Reported by: derelm Assigned to: nobody
Milestone: Component: Template system
Version: SVN Keywords: sprintsept14
Cc: Triage Stage: Accepted
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

i tried to chain truncatewords with markdown, but this doesn't work as truncatewords removes newlines from the input string. newlines are crucial for markdown and possibly other markup-parsers.

Attachments

truncate_words_preserving_whitespace.diff (1.2 kB) - added by Tom Tobin <korpios@gmail.com> on 05/26/06 16:53:19.
patch against r2991 (trunk); cause truncatewords filter to preserve whitespace
truncate_words_preserving_whites.diff (0.9 kB) - added by didier Belot <dib@electrolinux.net> on 08/01/07 09:01:52.
patch against trunk rev. 5783, simplified
1919-1.diff (2.5 kB) - added by mattmcc on 09/14/07 23:42:33.
Add a regression test
truncate_tests.diff (2.0 kB) - added by deryck on 09/21/07 13:41:44.
Tests for django.utils.text.trunacte_*.
truncate.py (2.5 kB) - added by SmileyChris on 09/21/07 15:17:10.
truncate_words.diff (1.8 kB) - added by arien <regexbot@gmail.com> on 11/13/07 16:12:25.
truncate_words that does what it says on the package; includes tests
truncate_words_faster.diff (2.0 kB) - added by arien <regexbot@gmail.com> on 11/14/07 01:57:47.
faster version of the same idea; includes additional tests for trailing whitespace handling

Change History

05/26/06 16:53:19 changed by Tom Tobin <korpios@gmail.com>

  • attachment truncate_words_preserving_whitespace.diff added.

patch against r2991 (trunk); cause truncatewords filter to preserve whitespace

05/26/06 16:55:42 changed by Tom Tobin <korpios@gmail.com>

I've attached a patch that will cause the truncatewords filter to preserve all whitespace (including leading and trailing whitespace).

01/17/07 16:12:17 changed by

  • milestone deleted.

Milestone Version 1.0 deleted

01/31/07 22:12:24 changed by SmileyChris

  • stage changed from Unreviewed to Ready for checkin.

01/31/07 22:12:39 changed by SmileyChris

  • has_patch set to 1.

02/09/07 20:04:58 changed by mtredinnick

  • owner changed from adrian to mtredinnick.
  • needs_better_patch set to 1.
  • stage changed from Ready for checkin to Accepted.

I'm going to bounce this patch back for improvement (I'm just doing a "check in the obviously correct" pass at the moment, so I don't want to stop and fix it right now).

I think you can get away with just one call to re.split() without the subsequent re.search() every time through the list. After the call to re.split(), every second element in the returned list is the whitespace. So you can iterate through the "words" list (in the patch's terminology) two elements at a time: the first one will be a word, the second the whitespace. That avoids the need to call whitespace_re.search() each time. It will improve performance a bit (and reduce the line count, too, I suspect).

A test of this would be nice, too, since it's not immediately clear that it works 100%.

02/09/07 20:45:43 changed by SmileyChris

  • needs_tests set to 1.

08/01/07 09:01:52 changed by didier Belot <dib@electrolinux.net>

  • attachment truncate_words_preserving_whites.diff added.

patch against trunk rev. 5783, simplified

08/01/07 09:15:10 changed by didier Belot <dib@electrolinux.net>

Yes, the word_re in my patch can be as simple as the first below

# split at words boundaries
word_re = re.compile(r'(\w+)',re.UNICODE) 

# split at **markdown** or _such marked_  up words
word_re = re.compile(r'((?:\*|_)*\w+(?:\*|_)*)',re.UNICODE) 

Anyway, it was just for the fun, as I don't think this is the way to go: chaining truncatewords to markdown will break anyway! I Think it's better to use markdon first, then chain to htmltruncatewords.

What sort of test are needed for such a patch ?

09/14/07 23:42:33 changed by mattmcc

  • attachment 1919-1.diff added.

Add a regression test

09/14/07 23:44:07 changed by mattmcc

  • needs_better_patch deleted.
  • needs_tests deleted.

09/14/07 23:48:59 changed by mattmcc

  • keywords set to sprintsept14.

09/16/07 07:37:20 changed by ubernostrum

  • status changed from new to closed.
  • resolution set to fixed.

An HTML-safe truncation filter was added some time ago, so the solution here is to do the markup filtering before truncating.

09/16/07 16:03:16 changed by SmileyChris

  • status changed from closed to reopened.
  • resolution deleted.

I disagree that it's been fixed. truncate_words should preserve existing white space. We have a patch here and one in #4655. Either one of those seems to do the job.

09/21/07 13:40:56 changed by deryck

I agree with James that this should be closed. Just use truncate_html_words to preserve space. Changing truncate_words will affect the performance of that function.

Not closing it again, though, because I'm attaching tests to show truncate_html_words will do what's needed and just so django.utils.text.truncate_* gets a little test coverage.

09/21/07 13:41:44 changed by deryck

  • attachment truncate_tests.diff added.

Tests for django.utils.text.trunacte_*.

09/21/07 15:16:30 changed by anonymous

Rubbish about affecting performance. My version is faster than the current one anyway! Run truncate.py from a manage.py shell if you want a test comparison

09/21/07 15:17:10 changed by SmileyChris

  • attachment truncate.py added.

09/21/07 15:17:46 changed by SmileyChris

Stupid basic HTTP auth. Twas my anonymous comment.

11/13/07 16:12:25 changed by arien <regexbot@gmail.com>

  • attachment truncate_words.diff added.

truncate_words that does what it says on the package; includes tests

11/14/07 01:57:47 changed by arien <regexbot@gmail.com>

  • attachment truncate_words_faster.diff added.

faster version of the same idea; includes additional tests for trailing whitespace handling

11/14/07 02:49:41 changed by arien <regexbot@gmail.com>

The last patch contains a version of trunacte_words that preserves all whitespace before the nth "word" (sequence of non-blanks); see the tests for exact behaviour.

(This version of truncate_words differs from the previous version in that it doesn't use re.UNICODE for the regex and that it updates the counter by hand instead of using enumerate. This patch also contains a couple more tests for trailing whitespace handling.)

This version is somewhat slower than the current version when the string contains few words or is to be truncated after a large fraction of the words in the string, but is faster when truncating after a small fraction of the words in a strings containing many words. YMMV.

I tried to a version using re.split as well (splitting on and capturing non-whitespace, with max_split=length) but it consistently performed worse than the attached version IIRC.

One comment about testing the performance of the current version and alternatives to each other: be sure to compare like with like, i.e., make sure your benchmarking code looks like this:

from django.utils.text import truncate_words
from django.utils.encoding import force_unicode
from django.utils.functional import allow_lazy

def my_truncate_words(s, num):
    length = int(num)
    s = force_unicode(s)
    # code to return a Unicode string
my_truncate_words = allow_lazy(my_truncate_words, unicode)

# other versions follow same pattern

11/14/07 15:06:04 changed by SmileyChris

Looks good, arien. Only question is that I'm not sure this is the correct behaviour:

>>> truncatewords(u'  Text with leading and trailing whitespace  ', 6)
u'  Text with leading and trailing whitespace ...'

It seems like the filter should be counting the words, not the spaces between them. So if there's 6 words, it should just output the original string. Thoughts?

11/16/07 11:31:00 changed by arien <regexbot@gmail.com>

I think you can argue either way. The approach I took was to truncate immediately after the nth word if the the input string had any characters following it. The reason: "Truncates a string after a certain number of words." ;-)

What seem to be proposing is to return a string that has no more than n words in it. (I think that'll mean counting spaces and not words. We'll see...) Does this look reasonable?

>>> truncatewords(u'  Double-spaced  text.  ', 2) # and above
u'  Double-spaced  text.  '
>>> truncatewords(u'  Double-spaced  text.  ', 1)
u'  Double-spaced  ...'
>>> truncatewords(u'  Double-spaced  text.  ', 0) # and below
u'  ...'

11/18/07 01:20:57 changed by SmileyChris

I agree, it could be taken either way. Don't take my word as the way it should be ;)

How I see that it should work: If the string has less than or equal to the number of words, then the original string should be returned. Otherwise the string should be truncated to the Xth word and appended with ' ...' (one space then an elipsis).


Add/Change #1919 (filter truncatewords wipes newlines from string, so not chainable with markup filters)




Change Properties
Action