Worum geht es?
Weiter zum BeispielIch wollte zwei Texte vergleichen und ihre Unterschiede hervorheben. Die günstigste Lösung die ich gefunden habe war der Algorithmus des Longest Common Subsequence (LCS) Problems.
Die Vorteile
LCS kann beliebige Sequenzen vergleichen und findet längste übereinstimmende Abschnitte. Damit können sehr schön gelöschte oder hinzugefügte Teile gefunden werden. Diese Sequenzen müssen keine Texte sein. So werden mit diesem Algorithmus auch DNS-Abschnitte verglichen.
Nachteile von LCS
Aufwand und Speicherbedarf wachsen exponentiell mit der Länge der zu vergleichenden Sequenzen.
Solch starkes Wachstum an Speicher- und Zeitaufwand ist immer ein Problem. Da beides für Webserver knappe Resourcen sind, die auf viele Anfragen zu verteilen sind, sollte diese Funktion nicht unüberlegt eingesetzt werden.
Optimierungsmöglichkeiten
Oft bietet es sich an erst nach Übereinstimmungen an Anfang und Ende zu suchen und diese schon vorab als gleich zu werten. Bei Quellcode ist dieses Vorgehen meist sinnvoll. Damit ist aber nicht mehr sichergestellt, dass die längsten gleichen Teile gefunden werden. Außerdem traf dies einfach nicht auf meinen Anwendungsfall zu.
Vor allem Quellcode wird auch häufig nur zeilenweise verglichen. Dies reduziert die Anzahl Elemente und damit Zeit- und Speicherbedarf.
Beispiele
Weiter zum QuellcodeBei den Beispielen wird jeweils der Prozentsatz an Übereinstimmung sowie die ursprünglich benötigte Zeit ausgegeben. Der Prozentsatz ist der Quotient aus der Anzahl der Übereinstimmungen durch das Maximum der Längen der beiden Sequenzen.
Ein Vergleich zweier Texte nach Zeichen: 76,000 % (19.944ms)
| H«e»llo W«or»l«d»! LCS «w»it«h» PHP | H«a»llo W«e»l«t»! LCS «m»it PHP |
Die selben Texte nun Wortweise verglichen: 70,00 % (0,005ms)
| «Hello» «World»! LCS «with» PHP | «Hallo» «Welt»! LCS «mit» PHP |
Ein HTML-Text nun Wortweise verglichen: 76,28 % (3,666ms)
«Erlkönig»Autor: «Johann Wolfgang von Goethe» |
«Dieb»Autor«und Plagiator»: «Heiko Richler» |
Quellcode
Weiter zum Download 1
<?php
2 /*********************************************************************
3 *
4 * class for lcs (longest common subsequence)
5 *
6 * (C) 2005-2006 by Heiko Richler http://www.richler.de/
7 *
8 * Search for longest common subsequence problem (LCS) for details
9 * aber the algorithmus.
10 *
11 * This Code is provided unter the GNU Lesser General Public License
12 * http://www.gnu.org/licenses/lgpl.html
13 *
14 * No warranty at all!
15 *
16 * 2006-12-01 Heiko Richler
17 * Cleaned code and comments
18 */
19
20 class lcs {
21 /****************************************************************
22 *
23 * wordCompare(&$fir, &$sec, &$result)
24 *
25 * compares strings word by word not by character.
26 *
27 * $fir, $sec : Strings to compare
28 * $result : must be an object of the class report4lcs
29 * and will keep the result.
30 *
31 * returns a proportion or null in case of an error.
32 */
33 function wordCompare(&$fir, &$sec, &$result) {
34 // Regular Expressen to split strings into words. In this
35 // context a word may be a HTML-Tag, some white-spaces,
36 // letters and figures or the other
37 // characters. Letters inkludes german umlauts.
38 $pattern = "(<[^>]*>*". // HTML Tag
39 "|[ \t\n]+". // White Spaces
40 "|[a-zA-Z0-9äöüÄÖÜß]+". // Letters
41 "|[^a-zA-Z0-9äöüÄÖÜß \t\n<]+".// anything else
42 ")";
43
44 // doing the splitting and storing the product of the words in $power
45 $i = preg_match_all($pattern, $fir, $src_arr, PREG_PATTERN_ORDER);
46 $j = preg_match_all($pattern, $sec, $dst_arr, PREG_PATTERN_ORDER);
47 $power = $i * $j;
48 // if $power is to large the skript may be terminated
49 if ($power<200000) {
50 return lcs::Compare($src_arr[0], $dst_arr[0],
51 equal, count, SubArray, $result);
52 }
53 if ($fir==$sec) {
54 $result->put_eq($fir);
55 return 1;
56 }
57 $result->put_diff($fir, $sec);
58 return null;
59 }
60
61 /****************************************************************
62 *
63 * HTMLwordCompare(&$fir, &$sec, &$result)
64 *
65 * compares strings word by word not by character and respects
66 * HTML-Encoding.
67 *
68 * $fir, $sec : Strings to compare
69 * $result : must be an object of the class report4lcs
70 * and will keep the result.
71 *
72 * returns a proportion or null in case of an error.
73 */
74 function HTMLwordCompare($fir, $sec, &$result) {
75 // Regular Expressen to split strings into words. In this
76 // context a word may be a HTML-Tag, some white-spaces,
77 // letters and figures or the other
78 // characters. Letters inkludes html-encoding and german umlauts.
79 $pattern = "(<[^>]*>[ \t\n]*". // HTML Tag
80 "|[ \t\n]+". // White Spaces
81 "|([a-zA-Z0-9äöüÄÖÜß]|&[a-zA-Z#0-9]+;)+[ \t\n]*".// Letters
82 "|[^a-zA-Z0-9äöüÄÖÜß \t\n<]+[ \t\n]*".// else
83 ")";
84 // doing the splitting and storing the product of the words in $power
85 $i = preg_match_all($pattern, $fir, $src_arr, PREG_PATTERN_ORDER);
86 $j = preg_match_all($pattern, $sec, $dst_arr, PREG_PATTERN_ORDER);
87 $power = $i * $j;
88 // if $power is to large the skript may be terminated
89 if ($power<200000) {
90 return lcs::Compare($src_arr[0], $dst_arr[0],
91 trimmed_equal, count, SubArray, $result);
92 }
93 if ($fir==$sec) {
94 $result->put_eq($fir);
95 return 1;
96 }
97 $result->put_diff($fir, $sec);
98 return null;
99 }
100
101
102 /****************************************************************
103 *
104 * charCompare(&$fir, &$sec, &$result)
105 *
106 * compares strings by characters
107 *
108 * $fir, $sec : Strings to compare
109 * $result : must be an object of the class report4lcs
110 * and will keep the result.
111 *
112 * returns a proportion or null in case of an error.
113 */
114 function charCompare(&$fir, &$sec, &$result) {
115 return lcs::Compare($fir, $sec, equal, strlen, substr, $result);
116 }
117
118 /****************************************************************
119 *
120 * Compare($fir, $sec, $equal, $sizeop, $partop, &$result)
121 *
122 * compares strings by characters
123 *
124 * $fir, $sec : Strings to compare or Array of Symbols
125 * $equal : bool-function for equality
126 * $sizeop : a function to calculate a strings len
127 * $partop : a function to get a substring
128 * $result : must be an object of the class report4lcs
129 * and will get the result.
130 *
131 * returns a proportion
132 */
133 function Compare($fir, $sec, $equal, $sizeop, $partop, &$result) {
134 $lenFir = $sizeop($fir);
135 $lenSec = $sizeop($sec);
136 $table = array(array()); // of int [$lenFir, $lenSec]
137
138 //lcs:
139 for ($i=0; $i<$lenFir; $i++) {
140 for ($j=0; $j<$lenSec; $j++) {
141 $table[$i][$j] = -1;
142 }
143 }
144 for ($i = $lenFir; $i >= 0; $i--) {
145 for ($j = $lenSec; $j >= 0; $j--) {
146 if ($equal($fir[$i], $sec[$j])) {
147 $table[$i][$j] = 1 + $table[$i + 1][$j + 1];
148 }
149 else {
150 $table[$i][$j] = max($table[$i + 1][$j],
151 $table[$i][$j + 1]);
152 }
153 }
154 }
155 $equality = $table[0][0]-1; // max($lenFir, $lenSec);
156
157 // lets get results
158
159 $leftPos = 0;
160 $rightPos = 0;
161 while ($leftPos < $lenFir || $rightPos < $lenSec) {
162 if ($x < $lenFir && $y < $lenSec && $equal($fir[$leftPos],$sec[$rightPos])) {
163 // common part
164 $len = 0;
165 $x=$leftPos;
166 $y=$rightPos;
167 while ($x < $lenFir && $y < $lenSec && $equal($fir[$x], $sec[$y])) {
168 $x++;
169 $y++;
170 $len++;
171 }
172 $result->put_eq($partop($fir, $leftPos, $len));
173 $leftPos=$x;
174 $rightPos=$y;
175 }
176 else {
177 // unequal
178 $leftLen = 0;
179 $rightLen = 0;
180 $x=$leftPos;
181 $y=$rightPos;
182 while ($x < $lenFir && $y < $lenSec && !$equal($fir[$x], $sec[$y])) {
183 if ($table[$x+1][$y] > $table[$x][$y+1]) {
184 // included into left side
185 $x++;
186 $leftLen++;
187 }
188 elseif ($table[$x+1][$y] < $table[$x][$y+1]) {
189 // included into right side
190 $y++;
191 $rightLen++;
192 }
193 else {
194 //different
195 $x++;
196 $leftLen++;
197 $y++;
198 $rightLen++;
199 }
200 }
201 if ($x >= $lenFir || $y >= $lenSec) {
202 // done
203 $result->put_diff($partop($fir, $leftPos), $partop($sec, $rightPos));
204 $leftPos=$lenFir;
205 $rightPos=$lenSec;
206 }
207 else {
208 // there is more to come
209 $result->put_diff($partop($fir, $leftPos, $leftLen), $partop($sec, $rightPos, $rightLen));
210 $leftPos=$x;
211 $rightPos=$y;
212 }
213 }
214 }
215
216 if (max($lenFir, $lenSec)>0) {
217 return (($equality * 100) / max($lenFir, $lenSec));
218 }
219 else {
220 return null;
221 }
222 }
223 }
224
225
226
227 /****************************************************************************
228 * report4lcs
229 *
230 * Reporter (printout) class for the class lcs. The Method show in
231 * lcs uses the Methods from report4lcs to print out an given Result
232 */
233 class report4lcs {
234 /****************************************************************
235 *
236 * put_eq($value)
237 *
238 * Function to print an equal Part.
239 */
240 function put_eq($value) {
241 echo nl2br(htmlentities($value));
242 }
243
244 /****************************************************************
245 *
246 * put_diff($fir, $sec)
247 *
248 * Function to print the subsets that are different
249 */
250 function put_diff($fir, $sec) {
251 echo '<span style="background:#ffa;color:red;">';
252 echo nl2br(htmlentities($fir));
253 echo '</span>';
254 echo '<span style="background:#fcf;color:red;">';
255 echo nl2br(htmlentities($sec));
256 echo '</span>';
257 }
258 }
259
260
261
262 /****************************************************************************
263 * reportstorage4lcs
264 *
265 * Reporter class for the class lcs. The Method stores and shows the
266 * results given by lcs.
267 * This class is build to show not HTML-Text.
268 */
269 class reportstorage4lcs extends report4lcs {
270 var $data = array();
271
272 /****************************************************************
273 *
274 * put_eq($value)
275 *
276 * Stores an equal Part.
277 */
278 function put_eq($value) {
279 $this->data[] = $value;
280 }
281
282 /****************************************************************
283 *
284 * put_diff($fir, $sec)
285 *
286 * Stores different subsets
287 */
288 function put_diff($fir, $sec) {
289 $this->data[] = array($fir, $sec);
290 }
291
292 /****************************************************************
293 *
294 * x_nl2br($value, $deco, $enddeco, $nl)
295 *
296 * $value : the value to encode
297 * $deco : code to add decoration for different code
298 * $enddeco : code to terminate decoration
299 * $nl : Symbol to make an NewLine visible
300 *
301 * HTML-Encodes NewLines inkluding a given Symbol
302 */
303 function x_nl2br($value, $deco, $enddeco, $nl) {
304 return str_replace("\n","$nl<br>\n", $value);
305 }
306
307 /****************************************************************
308 *
309 * nl2br($value, $deco, $enddeco, $nl)
310 *
311 * $value : the value to encode
312 * $deco : code to add decoration for different code
313 * $enddeco : code to terminate decoration
314 * $nl : Symbol to make an NewLine visible
315 *
316 * HTML-Encodes NewLines
317 */
318 function nl2br($value, $deco, $enddeco, $nl) {
319 return nl2br($value);
320 }
321
322 /****************************************************************
323 *
324 * encode($value, $deco, $enddeco, $nl)
325 *
326 * $value : the value to encode
327 * $deco : code to add decoration for different code
328 * $enddeco : code to terminate decoration
329 * $nl : Symbol to make an NewLine visible
330 *
331 * HTML-Encodes the given string
332 */
333 function encode($value, $deco, $enddeco, $nl) {
334 return htmlentities($value);
335 }
336
337 /****************************************************************
338 *
339 * getHTML($pos, $deco, $enddeco, $nl)
340 *
341 * $pos : 0 for first, 1 for second
342 * $deco : code to add decoration for different code
343 * $enddeco : code to terminate decoration
344 * $nl : Symbol to make an NewLine visible
345 *
346 * Returns the
347 */
348 function getHTML($pos, $deco, $enddeco, $nl) {
349 $result='';
350 foreach($this->data as $value) {
351 if (is_array($value)) {
352 $puffer = $this->x_nl2br($this->encode($value[$pos],
353 $deco, $enddeco, $nl),
354 $deco, $enddeco, $nl);
355 if ($puffer!='') {
356 $result .= $deco;
357 $result .= $puffer;
358 $result .= $enddeco;
359 }
360 }
361 else {
362 $result .= $this->nl2br($this->encode($value,
363 $deco, $enddeco, $nl),
364 $deco, $enddeco, $nl);
365 }
366 }
367 return $result;
368 }
369
370 /****************************************************************
371 *
372 * Show($deco, $enddeco, $nl, $class)
373 *
374 * $deco : code to add decoration for different code
375 * $enddeco : code to terminate decoration
376 * $nl : Symbol to make an NewLine visible
377 * $class : CSS-Class to use for the HTML-table
378 *
379 * Returns the
380 */
381 function Show($deco = '<span style="background-color:#ffcccc;">',
382 $enddeco='</span>', $nl='↵', $class = '')
383 {
384 echo "<table class='$class'>";
385 echo '<tr><td>';
386 echo $this->getHTML(0, $deco, $enddeco, $nl);
387 echo '</td><td>';
388 echo $this->getHTML(1, $deco, $enddeco, $nl);
389 echo '</td></tr>';
390 echo '</table>';
391 }
392 }
393
394 /****************************************************************************
395 * reportstorage4lcs
396 *
397 * Reporter class for the class lcs. The Method stores and shows the
398 * results given by lcs.
399 * This class is build to show HTML-Text.
400 */
401 class reportstorageHTML4lcs extends reportstorage4lcs {
402 /****************************************************************
403 *
404 * x_nl2br($value, $deco, $enddeco, $nl)
405 *
406 * $value : the value to encode
407 * $deco : code to add decoration for different code
408 * $enddeco : code to terminate decoration
409 * $nl : Symbol to make an NewLine visible
410 *
411 * HTML-Encodes NewLines inkluding a given Symbol
412 */
413 function x_nl2br($value, $deco, $enddeco, $nl) {
414 return str_replace("<br>","$nl<br>",
415 str_replace("<br/>","$nl<br/>", $value));
416 }
417
418 /****************************************************************
419 *
420 * nl2br($value, $deco, $enddeco, $nl)
421 *
422 * $value : the value to encode
423 * $deco : code to add decoration for different code
424 * $enddeco : code to terminate decoration
425 * $nl : Symbol to make an NewLine visible
426 *
427 * HTML-Encodes NewLines
428 */
429 function nl2br($value, $deco, $enddeco, $nl) {
430 return $value;
431 }
432
433 /****************************************************************
434 *
435 * encode($value, $deco, $enddeco, $nl)
436 *
437 * $value : the value to encode
438 * $deco : code to add decoration for different code
439 * $enddeco : code to terminate decoration
440 * $nl : Symbol to make an NewLine visible
441 *
442 * HTML-Encodes the given string
443 */
444 function encode($value, $deco, $enddeco, $nl) {
445 return $value;
446 }
447 }
448
449
450
451 /****************************************************************
452 *
453 * SubArray($arr, $start, $len)
454 *
455 * concatenates a sequence of array-elements to one string
456 */
457 function SubArray($arr, $start, $len = null) {
458 if ($len === 0) {
459 return '';
460 }
461 $puffer = $arr[$start];
462 if ($len < 0) {
463 $ende = $start+$len;
464 if ($ende < 0) {
465 $ende = 0;
466 }
467 for ($i = $start - 1; $i > 0; $i--) {
468 $puffer .= $arr[$i];
469 }
470 }
471 else {
472 if (is_null($len)) {
473 $ende = count($arr);
474 }
475 elseif ($len > 0) {
476 $ende = $start + $len;
477 if ($ende > count($arr)) {
478 $ende = count($arr);
479 }
480 }
481 for ($i = $start + 1; $i < $ende; $i++) {
482 $puffer .= $arr[$i];
483 }
484 }
485 return $puffer;
486 }
487
488
489
490 /****************************************************************
491 *
492 * equal($lvalue, $rvalue)
493 *
494 * return $lvalue == $rvalue
495 */
496 function equal($lvalue, $rvalue) {
497 return $lvalue == $rvalue;
498 }
499
500
501
502 /****************************************************************
503 *
504 * equal($lvalue, $rvalue)
505 *
506 * return true if the trimmed strings are equal
507 */
508 function trimmed_equal($lvalue, $rvalue) {
509 return trim($lvalue) == trim($rvalue);
510 }
511
512 ?>
Herunter Laden von Longest_Common_Subsequence.php
Hier können Sie die Projektdateien zip-Archiv bekommen: Longest_Common_Subsequence.zip.