url-normalization


License

License

Categories

Categories

ORM Data
GroupId

GroupId

com.naytev
ArtifactId

ArtifactId

url-normalization_2.11
Last Version

Last Version

0.3
Release Date

Release Date

Type

Type

jar
Description

Description

url-normalization
url-normalization
Project URL

Project URL

https://github.com/naytev/url-normalization
Project Organization

Project Organization

com.naytev

Download url-normalization_2.11

How to add to project

<!-- https://jarcasting.com/artifacts/com.naytev/url-normalization_2.11/ -->
<dependency>
    <groupId>com.naytev</groupId>
    <artifactId>url-normalization_2.11</artifactId>
    <version>0.3</version>
</dependency>
// https://jarcasting.com/artifacts/com.naytev/url-normalization_2.11/
implementation 'com.naytev:url-normalization_2.11:0.3'
// https://jarcasting.com/artifacts/com.naytev/url-normalization_2.11/
implementation ("com.naytev:url-normalization_2.11:0.3")
'com.naytev:url-normalization_2.11:jar:0.3'
<dependency org="com.naytev" name="url-normalization_2.11" rev="0.3">
  <artifact name="url-normalization_2.11" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.naytev', module='url-normalization_2.11', version='0.3')
)
libraryDependencies += "com.naytev" % "url-normalization_2.11" % "0.3"
[com.naytev/url-normalization_2.11 "0.3"]

Dependencies

compile (4)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.11.4
log4j : log4j jar 1.2.17
commons-collections : commons-collections jar 3.2.1
org.apache.commons : commons-lang3 jar 3.1

test (2)

Group / Artifact Type Version
com.novocode : junit-interface jar 0.11
junit : junit jar 4.10

Project Modules

There are no modules declared in this project.

url-normalization

URL normalization (or URL canonicalization) in general is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent. For more detail see http://en.wikipedia.org/wiki/URL_normalization

Rather than providing several traditional types of normalization for SEO purpose this java libraries provides transforming URLs into comparable and therefore sortable URLs. You can use this code whenever a URL is used as (primary) key in your application or storage system. This library produces URL by inverting the domain level labels.

Examples

    ch.sentric/blog/berlin-buzzwords-2012-presentation-and-highlights
    ch.sentric/blog/berlin-buzzwords-2012-review-from-a-search-perspective
    ch.sentric/blog/comparing-cloudera-impala
    ch.sentric/blog/cucumber-goes-hadoop
    ch.sentric/blog/ein-treffen-mit-james-kinley-von-cloudera
    ch.sentric/blog/hadoop-best-practice-cluster-checklist
    ch.sentric/blog/hbase-sizing-notes
    ch.sentric/blog/highlights-of-apache-lucene-solr-4-0
    ch.sentric/blog/how-should-pig-and-hive-be-integrated-to-access-data-in-hadoop
    ch.sentric/blog/how-to-determine-hbase-row-sizes
    ch.sentric/blog/log-data-analysis-what-is-the-most-popular-apache-webserver-version
    ch.sentric/blog/monitoring-web-apps-with-cucumber
    ch.sentric/blog/rebuilding-a-solr-index-the-hard-way
    ch.sentric/blog/sentric-at-strata-conference-hadoop-world-2012-in-new-york
    ch.sentric/blog/sentric-becomes-cloudera-connect-partner
    ch.sentric/blog/sentric-speaking-at-apachecon-europe-2012
    ch.sentric/blog/whats-an-appropriate-use-case-for-kafka
    ch.sentric/blog/why-hadoop-and-why-now
    ch.sentric/blog/why-we-chose-solr-4-0-instead-of-elasticsearch

Normalization process

Normalizations that Preserve Semantics

  • Converting the host (and scheme) to lower case: The host (and scheme) components of the URL are case-insensitive. This normalizer will convert them to lowercase. Example: HTTP://www.Example.com/seARch → com.example/search

  • Decoding percent-encoded octets of unreserved characters: For consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, they will be decoded to their corresponding unreserved characters by this normalizer. Example: http://www.example.com/%7Eusername/ → com.example/~username/

  • Removing the default port: The default port (port 80 for the “http” scheme) is removed from a URL. Example: http://www.example.com:80/bar.html → com.example/bar.html

Normalizations that Change Semantics

  • Removing “www” as the first domain label: Some websites operate in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first. For example, http://example.com/ and http://www.example.com/ may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. This normalizer determines one of these URLs redirects to the other and normalize all URLs by removing the “www” first level domain. Example: http://www.example.com/search → com.example/search
  • Sorting the query parameters: Some web pages use more than one query parameter in the URL. This normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URL. Example: http://www.example.com/display?lang=en&article=fred → com.example/display?article=fred&lang=en
  • Removing the "?" when the query is empty: When the query is empty, there may be no need for the "?". Example: http://www.example.com/display? → com.example.com/display

Quickstart

  1. Grab the sources from github:

    $ git clone https://github.com/sentric/url-normalization.git
    $ cd url-normalization  
    
  2. Build:

    $ mvn assembly:assembly
    
  3. Test:

    $ mvn test
    

Example Code

    $ URL url = new URL("http://www.example.com:80/bar.html");
    $ url.getNormalizedUrl(); // --> com.example/bar.html    

License

url-normalization is released under Apache License Version 2.0, see LICENSE.txt for details.

githalytics.com alpha

com.naytev

Naytev

Versions

Version
0.3