Word文档转html并提取标题
Word文档转html并提取标题
最近做了一个功能,需要将word文档转化成html的格式,并提取出标题生成导航。考虑到功能的复杂程度,将需要降低为对“标题1”这种格式进行提取。
docx为后缀的文档(word2007)支持XML的文件格式,本质上是一个zip压缩包,解压出来就可以看到所有信息,可能正因为如果,使用XHTMLConverter便可以得到对应的html文档,且标题元素的class属性被标注为”X”+n(n为标题层级)。
但doc文档但相对麻烦,doc文档一般使用poi读取,用的比较多的html转换方式是使用poi中的WordToHtmlConverter进行转换,这个转换器并不会对标题进行特殊处理,将其当做普通有样式的一个段落(Paragraph)进行处理,因此会和其他普通段落混合在一起。对此有两种处理方法:
方案一:重写processParagraph方法,在标黄部分加上对标题的判断,对标题进行特殊处理,但由于WordToHtmlConverter的成员变量均声明为private,因此我采用了另一种方案。
protected void processParagraph(HWPFDocumentCore hwpfDocument, Element parentElement, int currentTableLevel, Paragraph paragraph, String bulletText) {
Element pElement = this.htmlDocumentFacade.createParagraph();
parentElement.appendChild(pElement);
StringBuilder style = new StringBuilder();
WordToHtmlUtils.addParagraphProperties(paragraph, style);
int charRuns = paragraph.numCharacterRuns();
if(charRuns != 0) {
CharacterRun characterRun = paragraph.getCharacterRun(0);
String pFontName;
int pFontSize;
if(characterRun != null) {
Triplet triplet = this.getCharacterRunTriplet(characterRun);
pFontSize = characterRun.getFontSize() / 2;
pFontName = triplet.fontName;
WordToHtmlUtils.addFontFamily(pFontName, style);
WordToHtmlUtils.addFontSize(pFontSize, style);
} else {
pFontSize = -1;
pFontName = "";
}
this.blocksProperies.push(new WordToHtmlConverter.BlockProperies(pFontName, pFontSize));
try {
if(WordToHtmlUtils.isNotEmpty(bulletText)) {
if(bulletText.endsWith("\t")) {
float defaultTab = 720.0F;
float firstLinePosition = (float)(paragraph.getIndentFromLeft() + paragraph.getFirstLineIndent() + 20);
float nextStop = (float)(Math.ceil((double)(firstLinePosition / 720.0F)) * 720.0D);
float spanMinWidth = nextStop - firstLinePosition;
Element span = this.htmlDocumentFacade.getDocument().createElement("span");
this.htmlDocumentFacade.addStyleClass(span, "s", "display: inline-block; text-indent: 0; min-width: " + spanMinWidth / 1440.0F + "in;");
pElement.appendChild(span);
Text textNode = this.htmlDocumentFacade.createText(bulletText.substring(0, bulletText.length() - 1) + '\u200b' + ' ');
span.appendChild(textNode);
} else {
Text textNode = this.htmlDocumentFacade.createText(bulletText.substring(0, bulletText.length() - 1));
pElement.appendChild(textNode);
}
}
this.processCharacters(hwpfDocument, currentTableLevel, paragraph, pElement);
} finally {
this.blocksProperies.pop();
}
if(style.length() > 0) {
this.htmlDocumentFacade.addStyleClass(pElement, "p", style.toString());
}
WordToHtmlUtils.compactSpans(pElement);
}
}
方案二:在word文档中进行埋点,然后在处理过后的html文档中根据itTitleMap进行再处理
private Map<String,String> setTitleElements(HWPFDocument wordObject ){ // 获取样式表 StyleSheet styleSheet = wordObject.getStyleSheet(); int styleTotal = wordObject.getStyleSheet().numStyles(); // 使用map映射存储标题信息 Map<String,String> idTitleMap = Maps.newHashMap(); Range range = wordObject.getRange(); for (int i = 0; i < range.numParagraphs(); i++) { // 获取样式信息 Paragraph paragraph = range.getParagraph(i); int styleIndex = paragraph.getStyleIndex(); if (styleTotal > styleIndex) { StyleDescription styleDescription = styleSheet.getStyleDescription(styleIndex); String descriptionName = styleDescription.getName(); if ( descriptionName != null && descriptionName.contains(FIRST_LEVEL_TITLE_DESCRIPTION)) { String uuid = UUIDHelper.getUuid(); String text = paragraph.text().replaceAll( "[\r\n]", "" ); paragraph.replaceText( uuid, false ); idTitleMap.put( uuid, text ); } } } return idTitleMap; }